Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Use affine.linearize_index (and delinearize_index) where possible #19087

Closed

Conversation

krzysz00
Copy link
Contributor

@krzysz00 krzysz00 commented Nov 8, 2024

There have been issues with the composition of affine maps being too general and loosing important information, like the fact that affine_map<(s0 + s1 * 32 + ... - (s0 floorDiv 16) * 16)> really should be affine_map<(s0 mod 16 + s1 * 32 + ...)>, and other issues with the ultimate IR that block low-level arithmetic optimizations.

The affine.delinearize_index operation represents the div/mod chains needed to break a flat index into its component parts. A recently added affine.linearize_index operation is its inverse - combining multiple indices into a flat 1D value.

Another advantage to linearize/delinearize is simpler upstream canonicalizations and lead to more streamlined generated code.

This PR updates the vector distribution code and other GPU-related code that I could find to

  1. Use affine.linearize_index to construct flat thread IDs
  2. Use affine.delinearize_index in places where there was a floorDiv/mod chain.

Additionally

  1. Change the scf.for with a thread ID as initial input approach to non-uniform execution for an scf.if to better reflect the intended control flow and enable a linearize
  2. Plumb the subgroup size through the transfer_read and transfer_write distribution patterns to enable better reasoning about when you do/don't need to take a mod of the lane ID

@krzysz00 krzysz00 force-pushed the gpu-distribute-with-linearize branch 2 times, most recently from 819a5ae to c67c36d Compare November 12, 2024 17:03
There have been issues with the composition of affine maps being too
general and loosing important information, like the fact that
affine_map<(s0 + s1 * 32 + ... - (s0 floorDiv 16) * 16)> realy should
be affine_map<(s0 mod 16 + s1 * 32 + ...)>, and other issues with the
ultimate IR that block low-level arithmetic optimizations.

The affine.delinearize_index operation represents the div/mod chains
needed to break a flat index into its component parts. A recently
added affine.linearize_index operation is its inverse - combining
multiple indices into a flat 1D value.

Another advantage to linearize/delinearize is simpler upstream
canonicalizations and lead to more streamlined generated code.

This PR updates the vector distribution code and other GPU-related
code that I could find to

1. Use affine.linearize_index to construct flat thread IDs
2. Use affine.delinearize_index in places where there was a
floorDiv/mod chain.
@krzysz00 krzysz00 force-pushed the gpu-distribute-with-linearize branch from c67c36d to 8192c87 Compare November 12, 2024 19:22
@krzysz00 krzysz00 marked this pull request as ready for review November 12, 2024 19:48
@@ -143,6 +153,22 @@ LogicalResult resolveGPUMappedForallOp(RewriterBase &rewriter,
newBlockArgs);
rewriter.eraseOp(forallTerminator);
rewriter.eraseOp(forallOp);

// Step 5. Create the post-loop code that only executes on some workitems.
if (hasPostLoopTail) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this into a separate PR?

@krzysz00
Copy link
Contributor Author

Closing in favor of #19122 so I can split the PR

@krzysz00 krzysz00 closed this Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants