GPU data tiling on RDNA3 #18980

bjacob · 2024-10-31T20:34:11Z

A few things were needed:

Populate required fields in KnownTargets.cpp.
Support the case where the intrinsic vector operand size is greater than the load instruction size (here it is 16xf16 = 256 bit).
buildMmaOperation creates vector.insert_strided_slice to insert the new accumulator vectors into the accumulator tile. In doing so, it was relying on vector.insert_strided_slice implicit expand-shape semantics, in ways that worked for the shapes we had seen in CDNA3 but not here. Solved by explicitly expanding shapes with vector.shape_cast ops.
In thread-distribution code (populateOperandXxx), we needed to account for the nuance between two distinct thread grids: "layout" vs "distribution". In the case of RDNA3, there is a distribution-only dimension that isn't reflected in the layout-centric TileSwizzle's.
On RDNA3, some float arithmetic is strongly non-IEEE754-compliant: even exactly-representable small integral values, on which float arithmetic should be exact, have epsilon numerical errors! Addressed by tolerance.
Fix a bug: the doubly-nullable type std::optional<IREE::GPU::DataTiledMMAAttr> tricked us, change to IREE::GPU::DataTiledMMAAttr.

Groverkss

Very cool insights for the RDNA3 intrinsic. Thank you for doing this! I mainly have documentation comments, since this might not be obvious to someone who does not understand the intrinsic data reuse on threads very well.

compiler/src/iree/compiler/Codegen/Common/GPU/test/CMakeLists.txt

Groverkss · 2024-11-07T17:33:56Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp

+  // only to the layout dims, and do not reflect a possible additional
+  // thread-distribution-only dimension present on some architectures (RDNA3).
+  // When such an extra dim exists, multiple threads are reading the same data.
+  // So we need to distinguish layoutThreadSizes vs. distributionThreadSizes.


Do you think you could put an example of how this looks like for an intrinsic? This might not be obvious to someone who hasn't stared at RDNA3 intrinsics enough.

Actually I just expanded the general explanation, because I still felt that this was something that we are doing from first principles and can explain a such. I felt that to really work out an example would take more space than we could spend here.

Groverkss · 2024-11-07T17:38:00Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp

+  // Most of the rest of this function is the computation of the offsets.
+  // The basic idea is to delinearize the threadId over the basis of
+  // cross-thread dimensions. The main subtlety is that the TileSwizzles refer
+  // only to the layout dims, and do not reflect a possible additional
+  // thread-distribution-only dimension present on some architectures (RDNA3).
+  // When such an extra dim exists, multiple threads are reading the same data.
+  // So we need to distinguish layoutThreadSizes vs. distributionThreadSizes.


So if we have an intrinsic that takes let's say 32 elements, we are essentially broadcasting it to size 64, and distributing it. And when doing anything else, we ignore the broadcasted dimension, to get the same data on different threads.

Groverkss · 2024-11-07T17:38:32Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp

+    // Erase the delinearized index that corresponds to the extra distribution
+    // dimension that we had inserted above.
+    tileOffsets.erase(tileOffsets.begin() + distributionOnlyDimIdx);


Can you mention that this essentially leads to threads reading the same data, accomplishing what we wanted for the intrinsic.

Signed-off-by: Benoit Jacob <[email protected]>

Working on #18980 let me spend quality time with e2e matmul tests and suggested some changes. The main change is to simplify the printing of numerical values to always use high precision, meaning print all significant digits of floating point values. Since our tests generate small integral values and the intent is generally to be testing mostly the exact arithmetic that happens on small integral values, in most cases this doesn't make any difference. But I found that RDNA3 float arithmetic produces non-exact results even on those values. As a result, I got values like 1+epsilon where 1 was expected, causing a test to fail (since we didn't know we needed to opt out from requiring exact results) and the test output cryptically printed both values as "1". The other change is to more consistently print the same number of rows and columns regardless of whether we are at the start or in the middle of a dimension, and to have that number be what we call "context" (before, it was "2 * context"). Also a seasonal emoji change. Signed-off-by: Benoit Jacob <[email protected]>

A few things were needed: * Populate required fields in KnownTargets.cpp. * Support the case where the intrinsic vector operand size is greater than the load instruction size (here it is 16xf16 = 256 bit). * `buildMmaOperation` creates `vector.insert_strided_slice` to insert the new accumulator vectors into the accumulator tile. In doing so, it was relying on `vector.insert_strided_slice` implicit expand-shape semantics, in ways that worked for the shapes we had seen in CDNA3 but not here. Solved by explicitly expanding shapes with `vector.shape_cast` ops. * In thread-distribution code (populateOperandXxx), we needed to account for the nuance between two distinct thread grids: "layout" vs "distribution". In the case of RDNA3, there is a distribution-only dimension that isn't reflected in the layout-centric TileSwizzle's. * On RDNA3, some float arithmetic is strongly non-IEEE754-compliant: even exactly-representable small integral values, on which float arithmetic should be exact, have epsilon numerical errors! Addressed by tolerance. * Fix a bug: the doubly-nullable type `std::optional<IREE::GPU::DataTiledMMAAttr>` tricked us, change to `IREE::GPU::DataTiledMMAAttr`. --------- Signed-off-by: Benoit Jacob <[email protected]>

Working on iree-org#18980 let me spend quality time with e2e matmul tests and suggested some changes. The main change is to simplify the printing of numerical values to always use high precision, meaning print all significant digits of floating point values. Since our tests generate small integral values and the intent is generally to be testing mostly the exact arithmetic that happens on small integral values, in most cases this doesn't make any difference. But I found that RDNA3 float arithmetic produces non-exact results even on those values. As a result, I got values like 1+epsilon where 1 was expected, causing a test to fail (since we didn't know we needed to opt out from requiring exact results) and the test output cryptically printed both values as "1". The other change is to more consistently print the same number of rows and columns regardless of whether we are at the start or in the middle of a dimension, and to have that number be what we call "context" (before, it was "2 * context"). Also a seasonal emoji change. Signed-off-by: Benoit Jacob <[email protected]>

A few things were needed: * Populate required fields in KnownTargets.cpp. * Support the case where the intrinsic vector operand size is greater than the load instruction size (here it is 16xf16 = 256 bit). * `buildMmaOperation` creates `vector.insert_strided_slice` to insert the new accumulator vectors into the accumulator tile. In doing so, it was relying on `vector.insert_strided_slice` implicit expand-shape semantics, in ways that worked for the shapes we had seen in CDNA3 but not here. Solved by explicitly expanding shapes with `vector.shape_cast` ops. * In thread-distribution code (populateOperandXxx), we needed to account for the nuance between two distinct thread grids: "layout" vs "distribution". In the case of RDNA3, there is a distribution-only dimension that isn't reflected in the layout-centric TileSwizzle's. * On RDNA3, some float arithmetic is strongly non-IEEE754-compliant: even exactly-representable small integral values, on which float arithmetic should be exact, have epsilon numerical errors! Addressed by tolerance. * Fix a bug: the doubly-nullable type `std::optional<IREE::GPU::DataTiledMMAAttr>` tricked us, change to `IREE::GPU::DataTiledMMAAttr`. --------- Signed-off-by: Benoit Jacob <[email protected]> Signed-off-by: Giacomo Serafini <[email protected]>

Working on iree-org#18980 let me spend quality time with e2e matmul tests and suggested some changes. The main change is to simplify the printing of numerical values to always use high precision, meaning print all significant digits of floating point values. Since our tests generate small integral values and the intent is generally to be testing mostly the exact arithmetic that happens on small integral values, in most cases this doesn't make any difference. But I found that RDNA3 float arithmetic produces non-exact results even on those values. As a result, I got values like 1+epsilon where 1 was expected, causing a test to fail (since we didn't know we needed to opt out from requiring exact results) and the test output cryptically printed both values as "1". The other change is to more consistently print the same number of rows and columns regardless of whether we are at the start or in the middle of a dimension, and to have that number be what we call "context" (before, it was "2 * context"). Also a seasonal emoji change. Signed-off-by: Benoit Jacob <[email protected]> Signed-off-by: Giacomo Serafini <[email protected]>

bjacob force-pushed the users/bjacob/rdna3-data-tiling branch from 06542f5 to aec9f1b Compare November 4, 2024 18:55

bjacob changed the base branch from users/bjacob/cdna3-tests to main November 4, 2024 18:56

bjacob force-pushed the users/bjacob/rdna3-data-tiling branch from aec9f1b to 9351397 Compare November 4, 2024 19:22

bjacob marked this pull request as ready for review November 4, 2024 21:18

bjacob requested review from antiagainst and qedawkins as code owners November 4, 2024 21:18

bjacob requested a review from Groverkss November 4, 2024 21:19

bjacob mentioned this pull request Nov 4, 2024

e2e matmul test improvements #19016

Merged

bjacob requested a review from kuhar November 4, 2024 21:31

Groverkss approved these changes Nov 7, 2024

View reviewed changes

bjacob added 2 commits November 7, 2024 13:50

rdna3-dt

72b39e4

Signed-off-by: Benoit Jacob <[email protected]>

review comments

04bc2d2

Signed-off-by: Benoit Jacob <[email protected]>

bjacob force-pushed the users/bjacob/rdna3-data-tiling branch from ce464c9 to 04bc2d2 Compare November 7, 2024 21:11

bjacob merged commit 8e5f218 into main Nov 7, 2024
39 checks passed

bjacob deleted the users/bjacob/rdna3-data-tiling branch November 7, 2024 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU data tiling on RDNA3 #18980

GPU data tiling on RDNA3 #18980

bjacob commented Oct 31, 2024 •

edited

Loading

Groverkss left a comment

Groverkss Nov 7, 2024

bjacob Nov 7, 2024

Groverkss Nov 7, 2024

bjacob Nov 7, 2024

Groverkss Nov 7, 2024

bjacob Nov 7, 2024

GPU data tiling on RDNA3 #18980

GPU data tiling on RDNA3 #18980

Conversation

bjacob commented Oct 31, 2024 • edited Loading

Groverkss left a comment

Choose a reason for hiding this comment

Groverkss Nov 7, 2024

Choose a reason for hiding this comment

bjacob Nov 7, 2024

Choose a reason for hiding this comment

Groverkss Nov 7, 2024

Choose a reason for hiding this comment

bjacob Nov 7, 2024

Choose a reason for hiding this comment

Groverkss Nov 7, 2024

Choose a reason for hiding this comment

bjacob Nov 7, 2024

Choose a reason for hiding this comment

bjacob commented Oct 31, 2024 •

edited

Loading