Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Use affine.linearize_index (and delinearize_index) where possible #19122

Merged
merged 1 commit into from
Nov 26, 2024

Conversation

krzysz00
Copy link
Contributor

There have been issues with the composition of affine maps being too general and loosing important information, like the fact that affine_map<(s0 + s1 * 32 + ... - (s0 floorDiv 16) * 16)> realy should be affine_map<(s0 mod 16 + s1 * 32 + ...)>, and other issues with the ultimate IR that block low-level arithmetic optimizations.

The affine.delinearize_index operation represents the div/mod chains needed to break a flat index into its component parts. A recently added affine.linearize_index operation is its inverse - combining multiple indices into a flat 1D value.

Another advantage to linearize/delinearize is simpler upstream canonicalizations and lead to more streamlined generated code.

This PR updates the vector distribution code and other GPU-related code that I could find to

  1. Use affine.linearize_index to construct flat thread IDs
  2. Use affine.delinearize_index in places where there was a floorDiv/mod chain.
  3. Plumb the subgroup size through the transfer_read and transfer_write distribution patterns to enable better reasoning about when you do/don't need to take a mod of the lane ID

Copy link
Contributor

@Groverkss Groverkss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VectorDistribute changes LGTM

Comment on lines 37 to 50
Location loc = laneId.getLoc();

auto [laneDimX, laneDimY, laneDimZ] = layout.getLaneGrid();
int64_t gridsPerSubgroup =
llvm::divideCeil(subgroupSize, laneDimX * laneDimY * laneDimZ);
// Note: we add an extra entry to the delinearization here so that, if the
// vector layout requires fewer lanes than are present in the subgroup.
// Otherwise, we'd, for example, construct delinearizations with the basis (1,
// 1, 16) when there are 32 lanes, which would simplify to no delinearization
// at all. To resolve this, we add an extra term to the grid to capture the
// overflow.
auto reversedLaneGrid = rewriter.create<affine::AffineDelinearizeIndexOp>(
loc, laneId,
ArrayRef<int64_t>{gridsPerSubgroup, laneDimZ, laneDimY, laneDimX});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything related to LayoutAttr is depreciated and to be deleted. I'm not checking if these changes are right or wrong.

// CHECK-DAG: %[[IN_IDS:.+]]:2 = affine.delinearize_index %[[ID_CLAMPED]] into (4, 16)
// CHECK-DAG: %[[LHS_SLICE:.+]] = tensor.extract_slice %[[LHS]][0, 0, %[[IN_IDS]]#0, %[[IN_IDS]]#1] [1, 1, 1, 1] [1, 1, 1, 1]
// CHECK-DAG: %[[RHS_SLICE:.+]] = tensor.extract_slice %[[RHS]][0, 0, %[[IN_IDS]]#0, %[[IN_IDS]]#1] [1, 1, 1, 1] [1, 1, 1, 1]
// CHECK-DAG: %[[IN_IDS:.+]]:3 = affine.delinearize_index %[[THREAD_ID]] into (0, 4, 16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to understand, what does a basis value of 0 mean? Shouldn't it be 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 would be canonicalized away

0 here is being used as a "don't care" value, since the first element of the basis never actually gets used. I'm using it as a bit of a hack to force the clamping behavior this is replacing.

I'd be open to arguments for putting that affine map back, extending affine.delinearize_index to let it clamp if wanted, or for having that 0 be something like ub.poison (or an actual upper bound if we can swing it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the answer here is. Just that 0 in a basis is very weird. I think you know this piece of code better than me, and if you think this is right, I'm ok with it, but just reading this makes no sense to me. Maybe @qedawkins or @MaheshRavishankar have a better idea.

Copy link
Contributor

@Groverkss Groverkss Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a blocking comment, just something that looked weird to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Yeah, on inspection of the modern MMA code, affine.delinearize_index clamp upstream is probably the way to go here. Or some other blessed way to get an outermost mod.

... making the front basis value (which isn't actually used in any computations) optional is a thought.

@krzysz00 krzysz00 marked this pull request as draft November 13, 2024 16:58
@krzysz00
Copy link
Contributor Author

(re-drafted since comments above revealed an upstream change I need to make)

@krzysz00 krzysz00 force-pushed the users/krzysz00/gpu-distribute-with-linearize branch from c3eabac to 5207e94 Compare November 13, 2024 21:15
@krzysz00 krzysz00 marked this pull request as ready for review November 14, 2024 02:31
Copy link
Contributor

@qedawkins qedawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes outside of GPUDistributionPatterns LGTM. Re: the clamping behavior with delinearize_index, we should just go with whatever folds/composes best, otherwise I don't have any opinion.

Holding approval until llvm changes are landed

@@ -189,30 +189,29 @@ SmallVector<linalg::ProcInfo> getIds(OpBuilder &b, Location loc,
ArrayRef<Range> parallelLoopRanges,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is only used on legacy paths today and (I think) should be deprecated/rewritten. Changing this pass might involve updating a large number of tests so if you want you can drop the changes here, they shouldn't be on the critical paths. Definitely would still stamp though :)

flatId = rewriter
.create<affine::AffineDelinearizeIndexOp>(
loc, flatId,
ArrayRef<int64_t>{flatWorkgroupSize / subgroupSize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add something like

if (flatWorkgroupSize % subgroupSize != 0) {
  forallOp->emitOpError("found warp mapped forall with non-multiple workgroup size");
  return failure();
}

I just realized this could silently pass through even if they aren't a multiple, and I'm fairly certain we don't handle such a case properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -34,28 +34,28 @@ hal.executable @transpose_dispatch_0 {
// CHECK-LABEL: hal.executable public @transpose_dispatch_0
// CHECK-DAG: %[[CST:.*]] = arith.constant 0.000000e+00 : f32
// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index
// CHECK-DAG: %[[D0:.*]] = gpu.thread_id x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry you ended up updating this test, I've been meaning to delete this + deprecate the transpose pipeline for a while. Might be worth syncing up at some point so we can align on which pipelines/passes are still in use and which are planned to be deprecated.

@krzysz00 krzysz00 force-pushed the users/krzysz00/gpu-distribute-with-linearize branch 2 times, most recently from 5328767 to 5a8fa83 Compare November 21, 2024 20:54
Copy link
Contributor

@qedawkins qedawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, LGTM!

There have been issues with the composition of affine maps being too
general and loosing important information, like the fact that
affine_map<(s0 + s1 * 32 + ... - (s0 floorDiv 16) * 16)> realy should
be affine_map<(s0 mod 16 + s1 * 32 + ...)>, and other issues with the
ultimate IR that block low-level arithmetic optimizations.

The affine.delinearize_index operation represents the div/mod chains
needed to break a flat index into its component parts. A recently
added affine.linearize_index operation is its inverse - combining
multiple indices into a flat 1D value.

Another advantage to linearize/delinearize is simpler upstream
canonicalizations and lead to more streamlined generated code.

This PR updates the vector distribution code and other GPU-related
code that I could find to

1. Use affine.linearize_index to construct flat thread IDs
2. Use affine.delinearize_index in places where there was a
floorDiv/mod chain.
3. Plumb the subgroup size through the transfer_read and
transfer_write distribution patterns to enable better reasoning about
when you do/don't need to take a mod of the lane ID
@krzysz00 krzysz00 force-pushed the users/krzysz00/gpu-distribute-with-linearize branch from 5a8fa83 to 291f570 Compare November 26, 2024 19:24
@krzysz00 krzysz00 merged commit 031accb into main Nov 26, 2024
40 checks passed
@krzysz00 krzysz00 deleted the users/krzysz00/gpu-distribute-with-linearize branch November 26, 2024 20:38
Groverkss pushed a commit to Groverkss/iree that referenced this pull request Dec 1, 2024
…le (iree-org#19122)

There have been issues with the composition of affine maps being too
general and loosing important information, like the fact that
affine_map<(s0 + s1 * 32 + ... - (s0 floorDiv 16) * 16)> realy should be
affine_map<(s0 mod 16 + s1 * 32 + ...)>, and other issues with the
ultimate IR that block low-level arithmetic optimizations.

The affine.delinearize_index operation represents the div/mod chains
needed to break a flat index into its component parts. A recently added
affine.linearize_index operation is its inverse - combining multiple
indices into a flat 1D value.

Another advantage to linearize/delinearize is simpler upstream
canonicalizations and lead to more streamlined generated code.

This PR updates the vector distribution code and other GPU-related code
that I could find to

1. Use affine.linearize_index to construct flat thread IDs
2. Use affine.delinearize_index in places where there was a floorDiv/mod
chain.
3. Plumb the subgroup size through the transfer_read and transfer_write
distribution patterns to enable better reasoning about when you do/don't
need to take a mod of the lane ID
giacs-epic pushed a commit to giacs-epic/iree that referenced this pull request Dec 4, 2024
…le (iree-org#19122)

There have been issues with the composition of affine maps being too
general and loosing important information, like the fact that
affine_map<(s0 + s1 * 32 + ... - (s0 floorDiv 16) * 16)> realy should be
affine_map<(s0 mod 16 + s1 * 32 + ...)>, and other issues with the
ultimate IR that block low-level arithmetic optimizations.

The affine.delinearize_index operation represents the div/mod chains
needed to break a flat index into its component parts. A recently added
affine.linearize_index operation is its inverse - combining multiple
indices into a flat 1D value.

Another advantage to linearize/delinearize is simpler upstream
canonicalizations and lead to more streamlined generated code.

This PR updates the vector distribution code and other GPU-related code
that I could find to

1. Use affine.linearize_index to construct flat thread IDs
2. Use affine.delinearize_index in places where there was a floorDiv/mod
chain.
3. Plumb the subgroup size through the transfer_read and transfer_write
distribution patterns to enable better reasoning about when you do/don't
need to take a mod of the lane ID

Signed-off-by: Giacomo Serafini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants