-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Use affine.linearize_index (and delinearize_index) where possible #19122
Conversation
acc087c
to
c3eabac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VectorDistribute changes LGTM
Location loc = laneId.getLoc(); | ||
|
||
auto [laneDimX, laneDimY, laneDimZ] = layout.getLaneGrid(); | ||
int64_t gridsPerSubgroup = | ||
llvm::divideCeil(subgroupSize, laneDimX * laneDimY * laneDimZ); | ||
// Note: we add an extra entry to the delinearization here so that, if the | ||
// vector layout requires fewer lanes than are present in the subgroup. | ||
// Otherwise, we'd, for example, construct delinearizations with the basis (1, | ||
// 1, 16) when there are 32 lanes, which would simplify to no delinearization | ||
// at all. To resolve this, we add an extra term to the grid to capture the | ||
// overflow. | ||
auto reversedLaneGrid = rewriter.create<affine::AffineDelinearizeIndexOp>( | ||
loc, laneId, | ||
ArrayRef<int64_t>{gridsPerSubgroup, laneDimZ, laneDimY, laneDimX}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything related to LayoutAttr is depreciated and to be deleted. I'm not checking if these changes are right or wrong.
compiler/src/iree/compiler/Codegen/LLVMGPU/TransformExtensions/LLVMGPUExtensionsOps.td
Show resolved
Hide resolved
// CHECK-DAG: %[[IN_IDS:.+]]:2 = affine.delinearize_index %[[ID_CLAMPED]] into (4, 16) | ||
// CHECK-DAG: %[[LHS_SLICE:.+]] = tensor.extract_slice %[[LHS]][0, 0, %[[IN_IDS]]#0, %[[IN_IDS]]#1] [1, 1, 1, 1] [1, 1, 1, 1] | ||
// CHECK-DAG: %[[RHS_SLICE:.+]] = tensor.extract_slice %[[RHS]][0, 0, %[[IN_IDS]]#0, %[[IN_IDS]]#1] [1, 1, 1, 1] [1, 1, 1, 1] | ||
// CHECK-DAG: %[[IN_IDS:.+]]:3 = affine.delinearize_index %[[THREAD_ID]] into (0, 4, 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to understand, what does a basis value of 0 mean? Shouldn't it be 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1
would be canonicalized away
0
here is being used as a "don't care" value, since the first element of the basis never actually gets used. I'm using it as a bit of a hack to force the clamping behavior this is replacing.
I'd be open to arguments for putting that affine map back, extending affine.delinearize_index to let it clamp if wanted, or for having that 0 be something like ub.poison
(or an actual upper bound if we can swing it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the answer here is. Just that 0 in a basis is very weird. I think you know this piece of code better than me, and if you think this is right, I'm ok with it, but just reading this makes no sense to me. Maybe @qedawkins or @MaheshRavishankar have a better idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a blocking comment, just something that looked weird to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... Yeah, on inspection of the modern MMA code, affine.delinearize_index clamp
upstream is probably the way to go here. Or some other blessed way to get an outermost mod
.
... making the front basis value (which isn't actually used in any computations) optional is a thought.
(re-drafted since comments above revealed an upstream change I need to make) |
c3eabac
to
5207e94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes outside of GPUDistributionPatterns LGTM. Re: the clamping behavior with delinearize_index
, we should just go with whatever folds/composes best, otherwise I don't have any opinion.
Holding approval until llvm changes are landed
@@ -189,30 +189,29 @@ SmallVector<linalg::ProcInfo> getIds(OpBuilder &b, Location loc, | |||
ArrayRef<Range> parallelLoopRanges, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is only used on legacy paths today and (I think) should be deprecated/rewritten. Changing this pass might involve updating a large number of tests so if you want you can drop the changes here, they shouldn't be on the critical paths. Definitely would still stamp though :)
flatId = rewriter | ||
.create<affine::AffineDelinearizeIndexOp>( | ||
loc, flatId, | ||
ArrayRef<int64_t>{flatWorkgroupSize / subgroupSize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add something like
if (flatWorkgroupSize % subgroupSize != 0) {
forallOp->emitOpError("found warp mapped forall with non-multiple workgroup size");
return failure();
}
I just realized this could silently pass through even if they aren't a multiple, and I'm fairly certain we don't handle such a case properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -34,28 +34,28 @@ hal.executable @transpose_dispatch_0 { | |||
// CHECK-LABEL: hal.executable public @transpose_dispatch_0 | |||
// CHECK-DAG: %[[CST:.*]] = arith.constant 0.000000e+00 : f32 | |||
// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index | |||
// CHECK-DAG: %[[D0:.*]] = gpu.thread_id x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry you ended up updating this test, I've been meaning to delete this + deprecate the transpose pipeline for a while. Might be worth syncing up at some point so we can align on which pipelines/passes are still in use and which are planned to be deprecated.
5328767
to
5a8fa83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, LGTM!
There have been issues with the composition of affine maps being too general and loosing important information, like the fact that affine_map<(s0 + s1 * 32 + ... - (s0 floorDiv 16) * 16)> realy should be affine_map<(s0 mod 16 + s1 * 32 + ...)>, and other issues with the ultimate IR that block low-level arithmetic optimizations. The affine.delinearize_index operation represents the div/mod chains needed to break a flat index into its component parts. A recently added affine.linearize_index operation is its inverse - combining multiple indices into a flat 1D value. Another advantage to linearize/delinearize is simpler upstream canonicalizations and lead to more streamlined generated code. This PR updates the vector distribution code and other GPU-related code that I could find to 1. Use affine.linearize_index to construct flat thread IDs 2. Use affine.delinearize_index in places where there was a floorDiv/mod chain. 3. Plumb the subgroup size through the transfer_read and transfer_write distribution patterns to enable better reasoning about when you do/don't need to take a mod of the lane ID
5a8fa83
to
291f570
Compare
…le (iree-org#19122) There have been issues with the composition of affine maps being too general and loosing important information, like the fact that affine_map<(s0 + s1 * 32 + ... - (s0 floorDiv 16) * 16)> realy should be affine_map<(s0 mod 16 + s1 * 32 + ...)>, and other issues with the ultimate IR that block low-level arithmetic optimizations. The affine.delinearize_index operation represents the div/mod chains needed to break a flat index into its component parts. A recently added affine.linearize_index operation is its inverse - combining multiple indices into a flat 1D value. Another advantage to linearize/delinearize is simpler upstream canonicalizations and lead to more streamlined generated code. This PR updates the vector distribution code and other GPU-related code that I could find to 1. Use affine.linearize_index to construct flat thread IDs 2. Use affine.delinearize_index in places where there was a floorDiv/mod chain. 3. Plumb the subgroup size through the transfer_read and transfer_write distribution patterns to enable better reasoning about when you do/don't need to take a mod of the lane ID
…le (iree-org#19122) There have been issues with the composition of affine maps being too general and loosing important information, like the fact that affine_map<(s0 + s1 * 32 + ... - (s0 floorDiv 16) * 16)> realy should be affine_map<(s0 mod 16 + s1 * 32 + ...)>, and other issues with the ultimate IR that block low-level arithmetic optimizations. The affine.delinearize_index operation represents the div/mod chains needed to break a flat index into its component parts. A recently added affine.linearize_index operation is its inverse - combining multiple indices into a flat 1D value. Another advantage to linearize/delinearize is simpler upstream canonicalizations and lead to more streamlined generated code. This PR updates the vector distribution code and other GPU-related code that I could find to 1. Use affine.linearize_index to construct flat thread IDs 2. Use affine.delinearize_index in places where there was a floorDiv/mod chain. 3. Plumb the subgroup size through the transfer_read and transfer_write distribution patterns to enable better reasoning about when you do/don't need to take a mod of the lane ID Signed-off-by: Giacomo Serafini <[email protected]>
There have been issues with the composition of affine maps being too general and loosing important information, like the fact that affine_map<(s0 + s1 * 32 + ... - (s0 floorDiv 16) * 16)> realy should be affine_map<(s0 mod 16 + s1 * 32 + ...)>, and other issues with the ultimate IR that block low-level arithmetic optimizations.
The affine.delinearize_index operation represents the div/mod chains needed to break a flat index into its component parts. A recently added affine.linearize_index operation is its inverse - combining multiple indices into a flat 1D value.
Another advantage to linearize/delinearize is simpler upstream canonicalizations and lead to more streamlined generated code.
This PR updates the vector distribution code and other GPU-related code that I could find to