[GPU] Use affine.linearize_index (and delinearize_index) where possible #19122

krzysz00 · 2024-11-12T22:48:18Z

There have been issues with the composition of affine maps being too general and loosing important information, like the fact that affine_map<(s0 + s1 * 32 + ... - (s0 floorDiv 16) * 16)> realy should be affine_map<(s0 mod 16 + s1 * 32 + ...)>, and other issues with the ultimate IR that block low-level arithmetic optimizations.

The affine.delinearize_index operation represents the div/mod chains needed to break a flat index into its component parts. A recently added affine.linearize_index operation is its inverse - combining multiple indices into a flat 1D value.

Another advantage to linearize/delinearize is simpler upstream canonicalizations and lead to more streamlined generated code.

This PR updates the vector distribution code and other GPU-related code that I could find to

Use affine.linearize_index to construct flat thread IDs
Use affine.delinearize_index in places where there was a floorDiv/mod chain.
Plumb the subgroup size through the transfer_read and transfer_write distribution patterns to enable better reasoning about when you do/don't need to take a mod of the lane ID

Groverkss

VectorDistribute changes LGTM

Groverkss · 2024-11-12T23:20:24Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPUDistributionPatterns.cpp

+  Location loc = laneId.getLoc();
+
+  auto [laneDimX, laneDimY, laneDimZ] = layout.getLaneGrid();
+  int64_t gridsPerSubgroup =
+      llvm::divideCeil(subgroupSize, laneDimX * laneDimY * laneDimZ);
+  // Note: we add an extra entry to the delinearization here so that, if the
+  // vector layout requires fewer lanes than are present in the subgroup.
+  // Otherwise, we'd, for example, construct delinearizations with the basis (1,
+  // 1, 16) when there are 32 lanes, which would simplify to no delinearization
+  // at all. To resolve this, we add an extra term to the grid to capture the
+  // overflow.
+  auto reversedLaneGrid = rewriter.create<affine::AffineDelinearizeIndexOp>(
+      loc, laneId,
+      ArrayRef<int64_t>{gridsPerSubgroup, laneDimZ, laneDimY, laneDimX});


Anything related to LayoutAttr is depreciated and to be deleted. I'm not checking if these changes are right or wrong.

compiler/src/iree/compiler/Codegen/LLVMGPU/TransformExtensions/LLVMGPUExtensionsOps.td

Groverkss · 2024-11-12T23:23:21Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/test/distribute_mma_to_lanes.mlir

-//   CHECK-DAG:     %[[IN_IDS:.+]]:2 = affine.delinearize_index %[[ID_CLAMPED]] into (4, 16)
-//   CHECK-DAG:     %[[LHS_SLICE:.+]] = tensor.extract_slice %[[LHS]][0, 0, %[[IN_IDS]]#0, %[[IN_IDS]]#1] [1, 1, 1, 1] [1, 1, 1, 1]
-//   CHECK-DAG:     %[[RHS_SLICE:.+]] = tensor.extract_slice %[[RHS]][0, 0, %[[IN_IDS]]#0, %[[IN_IDS]]#1] [1, 1, 1, 1] [1, 1, 1, 1]
+//   CHECK-DAG:     %[[IN_IDS:.+]]:3 = affine.delinearize_index %[[THREAD_ID]] into (0, 4, 16)


Just wanted to understand, what does a basis value of 0 mean? Shouldn't it be 1?

1 would be canonicalized away

0 here is being used as a "don't care" value, since the first element of the basis never actually gets used. I'm using it as a bit of a hack to force the clamping behavior this is replacing.

I'd be open to arguments for putting that affine map back, extending affine.delinearize_index to let it clamp if wanted, or for having that 0 be something like ub.poison (or an actual upper bound if we can swing it)

I'm not sure what the answer here is. Just that 0 in a basis is very weird. I think you know this piece of code better than me, and if you think this is right, I'm ok with it, but just reading this makes no sense to me. Maybe @qedawkins or @MaheshRavishankar have a better idea.

This is not a blocking comment, just something that looked weird to me.

... Yeah, on inspection of the modern MMA code, affine.delinearize_index clamp upstream is probably the way to go here. Or some other blessed way to get an outermost mod.

... making the front basis value (which isn't actually used in any computations) optional is a thought.

krzysz00 · 2024-11-13T16:59:10Z

(re-drafted since comments above revealed an upstream change I need to make)

qedawkins

The changes outside of GPUDistributionPatterns LGTM. Re: the clamping behavior with delinearize_index, we should just go with whatever folds/composes best, otherwise I don't have any opinion.

Holding approval until llvm changes are landed

qedawkins · 2024-11-18T15:45:29Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPUDistributeSharedMemoryCopy.cpp

@@ -189,30 +189,29 @@ SmallVector<linalg::ProcInfo> getIds(OpBuilder &b, Location loc,
                                     ArrayRef<Range> parallelLoopRanges,


This file is only used on legacy paths today and (I think) should be deprecated/rewritten. Changing this pass might involve updating a large number of tests so if you want you can drop the changes here, they shouldn't be on the critical paths. Definitely would still stamp though :)

qedawkins · 2024-11-18T15:52:55Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPUDistributeForall.cpp

+    flatId = rewriter
+                 .create<affine::AffineDelinearizeIndexOp>(
+                     loc, flatId,
+                     ArrayRef<int64_t>{flatWorkgroupSize / subgroupSize,


Can we add something like

if (flatWorkgroupSize % subgroupSize != 0) { forallOp->emitOpError("found warp mapped forall with non-multiple workgroup size"); return failure(); }

I just realized this could silently pass through even if they aren't a multiple, and I'm fairly certain we don't handle such a case properly.

qedawkins · 2024-11-18T15:58:36Z

compiler/src/iree/compiler/Codegen/LLVMGPU/test/transpose_pipeline_test.mlir

@@ -34,28 +34,28 @@ hal.executable @transpose_dispatch_0 {
 // CHECK-LABEL:  hal.executable public @transpose_dispatch_0
 //   CHECK-DAG:  %[[CST:.*]] = arith.constant 0.000000e+00 : f32
 //   CHECK-DAG:  %[[C0:.*]] = arith.constant 0 : index
-//   CHECK-DAG:  %[[D0:.*]] = gpu.thread_id  x


Sorry you ended up updating this test, I've been meaning to delete this + deprecate the transpose pipeline for a while. Might be worth syncing up at some point so we can align on which pipelines/passes are still in use and which are planned to be deprecated.

qedawkins

nice, LGTM!

There have been issues with the composition of affine maps being too general and loosing important information, like the fact that affine_map<(s0 + s1 * 32 + ... - (s0 floorDiv 16) * 16)> realy should be affine_map<(s0 mod 16 + s1 * 32 + ...)>, and other issues with the ultimate IR that block low-level arithmetic optimizations. The affine.delinearize_index operation represents the div/mod chains needed to break a flat index into its component parts. A recently added affine.linearize_index operation is its inverse - combining multiple indices into a flat 1D value. Another advantage to linearize/delinearize is simpler upstream canonicalizations and lead to more streamlined generated code. This PR updates the vector distribution code and other GPU-related code that I could find to 1. Use affine.linearize_index to construct flat thread IDs 2. Use affine.delinearize_index in places where there was a floorDiv/mod chain. 3. Plumb the subgroup size through the transfer_read and transfer_write distribution patterns to enable better reasoning about when you do/don't need to take a mod of the lane ID

…le (iree-org#19122) There have been issues with the composition of affine maps being too general and loosing important information, like the fact that affine_map<(s0 + s1 * 32 + ... - (s0 floorDiv 16) * 16)> realy should be affine_map<(s0 mod 16 + s1 * 32 + ...)>, and other issues with the ultimate IR that block low-level arithmetic optimizations. The affine.delinearize_index operation represents the div/mod chains needed to break a flat index into its component parts. A recently added affine.linearize_index operation is its inverse - combining multiple indices into a flat 1D value. Another advantage to linearize/delinearize is simpler upstream canonicalizations and lead to more streamlined generated code. This PR updates the vector distribution code and other GPU-related code that I could find to 1. Use affine.linearize_index to construct flat thread IDs 2. Use affine.delinearize_index in places where there was a floorDiv/mod chain. 3. Plumb the subgroup size through the transfer_read and transfer_write distribution patterns to enable better reasoning about when you do/don't need to take a mod of the lane ID

…le (iree-org#19122) There have been issues with the composition of affine maps being too general and loosing important information, like the fact that affine_map<(s0 + s1 * 32 + ... - (s0 floorDiv 16) * 16)> realy should be affine_map<(s0 mod 16 + s1 * 32 + ...)>, and other issues with the ultimate IR that block low-level arithmetic optimizations. The affine.delinearize_index operation represents the div/mod chains needed to break a flat index into its component parts. A recently added affine.linearize_index operation is its inverse - combining multiple indices into a flat 1D value. Another advantage to linearize/delinearize is simpler upstream canonicalizations and lead to more streamlined generated code. This PR updates the vector distribution code and other GPU-related code that I could find to 1. Use affine.linearize_index to construct flat thread IDs 2. Use affine.delinearize_index in places where there was a floorDiv/mod chain. 3. Plumb the subgroup size through the transfer_read and transfer_write distribution patterns to enable better reasoning about when you do/don't need to take a mod of the lane ID Signed-off-by: Giacomo Serafini <[email protected]>

krzysz00 requested review from MaheshRavishankar, qedawkins, kuhar, Groverkss, antiagainst and hanhanW as code owners November 12, 2024 22:48

krzysz00 mentioned this pull request Nov 12, 2024

[GPU] Use affine.linearize_index (and delinearize_index) where possible #19087

Closed

krzysz00 force-pushed the users/krzysz00/gpu-distribute-with-linearize branch from acc087c to c3eabac Compare November 12, 2024 22:52

Groverkss reviewed Nov 12, 2024

View reviewed changes

krzysz00 marked this pull request as draft November 13, 2024 16:58

krzysz00 force-pushed the users/krzysz00/gpu-distribute-with-linearize branch from c3eabac to 5207e94 Compare November 13, 2024 21:15

krzysz00 marked this pull request as ready for review November 14, 2024 02:31

qedawkins reviewed Nov 18, 2024

View reviewed changes

krzysz00 force-pushed the users/krzysz00/gpu-distribute-with-linearize branch 2 times, most recently from 5328767 to 5a8fa83 Compare November 21, 2024 20:54

krzysz00 requested review from qedawkins and Groverkss November 22, 2024 04:30

qedawkins approved these changes Nov 26, 2024

View reviewed changes

krzysz00 force-pushed the users/krzysz00/gpu-distribute-with-linearize branch from 5a8fa83 to 291f570 Compare November 26, 2024 19:24

krzysz00 merged commit 031accb into main Nov 26, 2024
40 checks passed

krzysz00 deleted the users/krzysz00/gpu-distribute-with-linearize branch November 26, 2024 20:38

kuhar mentioned this pull request Nov 26, 2024

[LLVMGPU] Disable scf.forall distribution for matmulSimt #19302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Use affine.linearize_index (and delinearize_index) where possible #19122

[GPU] Use affine.linearize_index (and delinearize_index) where possible #19122

krzysz00 commented Nov 12, 2024

Groverkss left a comment

Groverkss Nov 12, 2024

Groverkss Nov 12, 2024

krzysz00 Nov 12, 2024

Groverkss Nov 12, 2024

Groverkss Nov 12, 2024 •

edited

Loading

krzysz00 Nov 13, 2024

krzysz00 commented Nov 13, 2024

qedawkins left a comment

qedawkins Nov 18, 2024

qedawkins Nov 18, 2024

krzysz00 Nov 22, 2024

qedawkins Nov 18, 2024

qedawkins left a comment

		@@ -189,30 +189,29 @@ SmallVector<linalg::ProcInfo> getIds(OpBuilder &b, Location loc,
		ArrayRef<Range> parallelLoopRanges,

[GPU] Use affine.linearize_index (and delinearize_index) where possible #19122

[GPU] Use affine.linearize_index (and delinearize_index) where possible #19122

Conversation

krzysz00 commented Nov 12, 2024

Groverkss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Groverkss Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krzysz00 commented Nov 13, 2024

qedawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qedawkins left a comment

Choose a reason for hiding this comment

Groverkss Nov 12, 2024 •

edited

Loading