Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU data tiling on RDNA3 #18980

Merged
merged 2 commits into from
Nov 7, 2024
Merged

GPU data tiling on RDNA3 #18980

merged 2 commits into from
Nov 7, 2024

Conversation

bjacob
Copy link
Contributor

@bjacob bjacob commented Oct 31, 2024

A few things were needed:

  • Populate required fields in KnownTargets.cpp.
  • Support the case where the intrinsic vector operand size is greater than the load instruction size (here it is 16xf16 = 256 bit).
  • buildMmaOperation creates vector.insert_strided_slice to insert the new accumulator vectors into the accumulator tile. In doing so, it was relying on vector.insert_strided_slice implicit expand-shape semantics, in ways that worked for the shapes we had seen in CDNA3 but not here. Solved by explicitly expanding shapes with vector.shape_cast ops.
  • In thread-distribution code (populateOperandXxx), we needed to account for the nuance between two distinct thread grids: "layout" vs "distribution". In the case of RDNA3, there is a distribution-only dimension that isn't reflected in the layout-centric TileSwizzle's.
  • On RDNA3, some float arithmetic is strongly non-IEEE754-compliant: even exactly-representable small integral values, on which float arithmetic should be exact, have epsilon numerical errors! Addressed by tolerance.
  • Fix a bug: the doubly-nullable type std::optional<IREE::GPU::DataTiledMMAAttr> tricked us, change to IREE::GPU::DataTiledMMAAttr.

@bjacob bjacob force-pushed the users/bjacob/rdna3-data-tiling branch from 06542f5 to aec9f1b Compare November 4, 2024 18:55
@bjacob bjacob changed the base branch from users/bjacob/cdna3-tests to main November 4, 2024 18:56
@bjacob bjacob force-pushed the users/bjacob/rdna3-data-tiling branch from aec9f1b to 9351397 Compare November 4, 2024 19:22
@bjacob bjacob marked this pull request as ready for review November 4, 2024 21:18
@bjacob bjacob requested a review from Groverkss November 4, 2024 21:19
@bjacob bjacob requested a review from kuhar November 4, 2024 21:31
Copy link
Contributor

@Groverkss Groverkss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool insights for the RDNA3 intrinsic. Thank you for doing this! I mainly have documentation comments, since this might not be obvious to someone who does not understand the intrinsic data reuse on threads very well.

// only to the layout dims, and do not reflect a possible additional
// thread-distribution-only dimension present on some architectures (RDNA3).
// When such an extra dim exists, multiple threads are reading the same data.
// So we need to distinguish layoutThreadSizes vs. distributionThreadSizes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you could put an example of how this looks like for an intrinsic? This might not be obvious to someone who hasn't stared at RDNA3 intrinsics enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I just expanded the general explanation, because I still felt that this was something that we are doing from first principles and can explain a such. I felt that to really work out an example would take more space than we could spend here.

Comment on lines 980 to 986
// Most of the rest of this function is the computation of the offsets.
// The basic idea is to delinearize the threadId over the basis of
// cross-thread dimensions. The main subtlety is that the TileSwizzles refer
// only to the layout dims, and do not reflect a possible additional
// thread-distribution-only dimension present on some architectures (RDNA3).
// When such an extra dim exists, multiple threads are reading the same data.
// So we need to distinguish layoutThreadSizes vs. distributionThreadSizes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we have an intrinsic that takes let's say 32 elements, we are essentially broadcasting it to size 64, and distributing it. And when doing anything else, we ignore the broadcasted dimension, to get the same data on different threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Comment on lines 1032 to 1036
// Erase the delinearized index that corresponds to the extra distribution
// dimension that we had inserted above.
tileOffsets.erase(tileOffsets.begin() + distributionOnlyDimIdx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mention that this essentially leads to threads reading the same data, accomplishing what we wanted for the intrinsic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: Benoit Jacob <[email protected]>
Signed-off-by: Benoit Jacob <[email protected]>
@bjacob bjacob force-pushed the users/bjacob/rdna3-data-tiling branch from ce464c9 to 04bc2d2 Compare November 7, 2024 21:11
@bjacob bjacob merged commit 8e5f218 into main Nov 7, 2024
39 checks passed
@bjacob bjacob deleted the users/bjacob/rdna3-data-tiling branch November 7, 2024 21:40
bjacob added a commit that referenced this pull request Nov 16, 2024
Working on #18980 let me spend
quality time with e2e matmul tests and suggested some changes.

The main change is to simplify the printing of numerical values to
always use high precision, meaning print all significant digits of
floating point values.

Since our tests generate small integral values and the intent is
generally to be testing mostly the exact arithmetic that happens on
small integral values, in most cases this doesn't make any difference.

But I found that RDNA3 float arithmetic produces non-exact results even
on those values. As a result, I got values like 1+epsilon where 1 was
expected, causing a test to fail (since we didn't know we needed to opt
out from requiring exact results) and the test output cryptically
printed both values as "1".

The other change is to more consistently print the same number of rows
and columns regardless of whether we are at the start or in the middle
of a dimension, and to have that number be what we call "context"
(before, it was "2 * context").

Also a seasonal emoji change.

Signed-off-by: Benoit Jacob <[email protected]>
Groverkss pushed a commit to Groverkss/iree that referenced this pull request Dec 1, 2024
A few things were needed:
* Populate required fields in KnownTargets.cpp.
* Support the case where the intrinsic vector operand size is greater
than the load instruction size (here it is 16xf16 = 256 bit).
* `buildMmaOperation` creates `vector.insert_strided_slice` to insert
the new accumulator vectors into the accumulator tile. In doing so, it
was relying on `vector.insert_strided_slice` implicit expand-shape
semantics, in ways that worked for the shapes we had seen in CDNA3 but
not here. Solved by explicitly expanding shapes with `vector.shape_cast`
ops.
* In thread-distribution code (populateOperandXxx), we needed to account
for the nuance between two distinct thread grids: "layout" vs
"distribution". In the case of RDNA3, there is a distribution-only
dimension that isn't reflected in the layout-centric TileSwizzle's.
* On RDNA3, some float arithmetic is strongly non-IEEE754-compliant:
even exactly-representable small integral values, on which float
arithmetic should be exact, have epsilon numerical errors! Addressed by
tolerance.
* Fix a bug: the doubly-nullable type
`std::optional<IREE::GPU::DataTiledMMAAttr>` tricked us, change to
`IREE::GPU::DataTiledMMAAttr`.

---------

Signed-off-by: Benoit Jacob <[email protected]>
Groverkss pushed a commit to Groverkss/iree that referenced this pull request Dec 1, 2024
Working on iree-org#18980 let me spend
quality time with e2e matmul tests and suggested some changes.

The main change is to simplify the printing of numerical values to
always use high precision, meaning print all significant digits of
floating point values.

Since our tests generate small integral values and the intent is
generally to be testing mostly the exact arithmetic that happens on
small integral values, in most cases this doesn't make any difference.

But I found that RDNA3 float arithmetic produces non-exact results even
on those values. As a result, I got values like 1+epsilon where 1 was
expected, causing a test to fail (since we didn't know we needed to opt
out from requiring exact results) and the test output cryptically
printed both values as "1".

The other change is to more consistently print the same number of rows
and columns regardless of whether we are at the start or in the middle
of a dimension, and to have that number be what we call "context"
(before, it was "2 * context").

Also a seasonal emoji change.

Signed-off-by: Benoit Jacob <[email protected]>
giacs-epic pushed a commit to giacs-epic/iree that referenced this pull request Dec 4, 2024
A few things were needed:
* Populate required fields in KnownTargets.cpp.
* Support the case where the intrinsic vector operand size is greater
than the load instruction size (here it is 16xf16 = 256 bit).
* `buildMmaOperation` creates `vector.insert_strided_slice` to insert
the new accumulator vectors into the accumulator tile. In doing so, it
was relying on `vector.insert_strided_slice` implicit expand-shape
semantics, in ways that worked for the shapes we had seen in CDNA3 but
not here. Solved by explicitly expanding shapes with `vector.shape_cast`
ops.
* In thread-distribution code (populateOperandXxx), we needed to account
for the nuance between two distinct thread grids: "layout" vs
"distribution". In the case of RDNA3, there is a distribution-only
dimension that isn't reflected in the layout-centric TileSwizzle's.
* On RDNA3, some float arithmetic is strongly non-IEEE754-compliant:
even exactly-representable small integral values, on which float
arithmetic should be exact, have epsilon numerical errors! Addressed by
tolerance.
* Fix a bug: the doubly-nullable type
`std::optional<IREE::GPU::DataTiledMMAAttr>` tricked us, change to
`IREE::GPU::DataTiledMMAAttr`.

---------

Signed-off-by: Benoit Jacob <[email protected]>
Signed-off-by: Giacomo Serafini <[email protected]>
giacs-epic pushed a commit to giacs-epic/iree that referenced this pull request Dec 4, 2024
Working on iree-org#18980 let me spend
quality time with e2e matmul tests and suggested some changes.

The main change is to simplify the printing of numerical values to
always use high precision, meaning print all significant digits of
floating point values.

Since our tests generate small integral values and the intent is
generally to be testing mostly the exact arithmetic that happens on
small integral values, in most cases this doesn't make any difference.

But I found that RDNA3 float arithmetic produces non-exact results even
on those values. As a result, I got values like 1+epsilon where 1 was
expected, causing a test to fail (since we didn't know we needed to opt
out from requiring exact results) and the test output cryptically
printed both values as "1".

The other change is to more consistently print the same number of rows
and columns regardless of whether we are at the start or in the middle
of a dimension, and to have that number be what we call "context"
(before, it was "2 * context").

Also a seasonal emoji change.

Signed-off-by: Benoit Jacob <[email protected]>
Signed-off-by: Giacomo Serafini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants