-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU data tiling on RDNA3 #18980
GPU data tiling on RDNA3 #18980
Conversation
06542f5
to
aec9f1b
Compare
aec9f1b
to
9351397
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool insights for the RDNA3 intrinsic. Thank you for doing this! I mainly have documentation comments, since this might not be obvious to someone who does not understand the intrinsic data reuse on threads very well.
compiler/src/iree/compiler/Codegen/Common/GPU/test/CMakeLists.txt
Outdated
Show resolved
Hide resolved
// only to the layout dims, and do not reflect a possible additional | ||
// thread-distribution-only dimension present on some architectures (RDNA3). | ||
// When such an extra dim exists, multiple threads are reading the same data. | ||
// So we need to distinguish layoutThreadSizes vs. distributionThreadSizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think you could put an example of how this looks like for an intrinsic? This might not be obvious to someone who hasn't stared at RDNA3 intrinsics enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I just expanded the general explanation, because I still felt that this was something that we are doing from first principles and can explain a such. I felt that to really work out an example would take more space than we could spend here.
// Most of the rest of this function is the computation of the offsets. | ||
// The basic idea is to delinearize the threadId over the basis of | ||
// cross-thread dimensions. The main subtlety is that the TileSwizzles refer | ||
// only to the layout dims, and do not reflect a possible additional | ||
// thread-distribution-only dimension present on some architectures (RDNA3). | ||
// When such an extra dim exists, multiple threads are reading the same data. | ||
// So we need to distinguish layoutThreadSizes vs. distributionThreadSizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we have an intrinsic that takes let's say 32 elements, we are essentially broadcasting it to size 64, and distributing it. And when doing anything else, we ignore the broadcasted dimension, to get the same data on different threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
// Erase the delinearized index that corresponds to the extra distribution | ||
// dimension that we had inserted above. | ||
tileOffsets.erase(tileOffsets.begin() + distributionOnlyDimIdx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you mention that this essentially leads to threads reading the same data, accomplishing what we wanted for the intrinsic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: Benoit Jacob <[email protected]>
Signed-off-by: Benoit Jacob <[email protected]>
ce464c9
to
04bc2d2
Compare
Working on #18980 let me spend quality time with e2e matmul tests and suggested some changes. The main change is to simplify the printing of numerical values to always use high precision, meaning print all significant digits of floating point values. Since our tests generate small integral values and the intent is generally to be testing mostly the exact arithmetic that happens on small integral values, in most cases this doesn't make any difference. But I found that RDNA3 float arithmetic produces non-exact results even on those values. As a result, I got values like 1+epsilon where 1 was expected, causing a test to fail (since we didn't know we needed to opt out from requiring exact results) and the test output cryptically printed both values as "1". The other change is to more consistently print the same number of rows and columns regardless of whether we are at the start or in the middle of a dimension, and to have that number be what we call "context" (before, it was "2 * context"). Also a seasonal emoji change. Signed-off-by: Benoit Jacob <[email protected]>
A few things were needed: * Populate required fields in KnownTargets.cpp. * Support the case where the intrinsic vector operand size is greater than the load instruction size (here it is 16xf16 = 256 bit). * `buildMmaOperation` creates `vector.insert_strided_slice` to insert the new accumulator vectors into the accumulator tile. In doing so, it was relying on `vector.insert_strided_slice` implicit expand-shape semantics, in ways that worked for the shapes we had seen in CDNA3 but not here. Solved by explicitly expanding shapes with `vector.shape_cast` ops. * In thread-distribution code (populateOperandXxx), we needed to account for the nuance between two distinct thread grids: "layout" vs "distribution". In the case of RDNA3, there is a distribution-only dimension that isn't reflected in the layout-centric TileSwizzle's. * On RDNA3, some float arithmetic is strongly non-IEEE754-compliant: even exactly-representable small integral values, on which float arithmetic should be exact, have epsilon numerical errors! Addressed by tolerance. * Fix a bug: the doubly-nullable type `std::optional<IREE::GPU::DataTiledMMAAttr>` tricked us, change to `IREE::GPU::DataTiledMMAAttr`. --------- Signed-off-by: Benoit Jacob <[email protected]>
Working on iree-org#18980 let me spend quality time with e2e matmul tests and suggested some changes. The main change is to simplify the printing of numerical values to always use high precision, meaning print all significant digits of floating point values. Since our tests generate small integral values and the intent is generally to be testing mostly the exact arithmetic that happens on small integral values, in most cases this doesn't make any difference. But I found that RDNA3 float arithmetic produces non-exact results even on those values. As a result, I got values like 1+epsilon where 1 was expected, causing a test to fail (since we didn't know we needed to opt out from requiring exact results) and the test output cryptically printed both values as "1". The other change is to more consistently print the same number of rows and columns regardless of whether we are at the start or in the middle of a dimension, and to have that number be what we call "context" (before, it was "2 * context"). Also a seasonal emoji change. Signed-off-by: Benoit Jacob <[email protected]>
A few things were needed: * Populate required fields in KnownTargets.cpp. * Support the case where the intrinsic vector operand size is greater than the load instruction size (here it is 16xf16 = 256 bit). * `buildMmaOperation` creates `vector.insert_strided_slice` to insert the new accumulator vectors into the accumulator tile. In doing so, it was relying on `vector.insert_strided_slice` implicit expand-shape semantics, in ways that worked for the shapes we had seen in CDNA3 but not here. Solved by explicitly expanding shapes with `vector.shape_cast` ops. * In thread-distribution code (populateOperandXxx), we needed to account for the nuance between two distinct thread grids: "layout" vs "distribution". In the case of RDNA3, there is a distribution-only dimension that isn't reflected in the layout-centric TileSwizzle's. * On RDNA3, some float arithmetic is strongly non-IEEE754-compliant: even exactly-representable small integral values, on which float arithmetic should be exact, have epsilon numerical errors! Addressed by tolerance. * Fix a bug: the doubly-nullable type `std::optional<IREE::GPU::DataTiledMMAAttr>` tricked us, change to `IREE::GPU::DataTiledMMAAttr`. --------- Signed-off-by: Benoit Jacob <[email protected]> Signed-off-by: Giacomo Serafini <[email protected]>
Working on iree-org#18980 let me spend quality time with e2e matmul tests and suggested some changes. The main change is to simplify the printing of numerical values to always use high precision, meaning print all significant digits of floating point values. Since our tests generate small integral values and the intent is generally to be testing mostly the exact arithmetic that happens on small integral values, in most cases this doesn't make any difference. But I found that RDNA3 float arithmetic produces non-exact results even on those values. As a result, I got values like 1+epsilon where 1 was expected, causing a test to fail (since we didn't know we needed to opt out from requiring exact results) and the test output cryptically printed both values as "1". The other change is to more consistently print the same number of rows and columns regardless of whether we are at the start or in the middle of a dimension, and to have that number be what we call "context" (before, it was "2 * context"). Also a seasonal emoji change. Signed-off-by: Benoit Jacob <[email protected]> Signed-off-by: Giacomo Serafini <[email protected]>
A few things were needed:
buildMmaOperation
createsvector.insert_strided_slice
to insert the new accumulator vectors into the accumulator tile. In doing so, it was relying onvector.insert_strided_slice
implicit expand-shape semantics, in ways that worked for the shapes we had seen in CDNA3 but not here. Solved by explicitly expanding shapes withvector.shape_cast
ops.std::optional<IREE::GPU::DataTiledMMAAttr>
tricked us, change toIREE::GPU::DataTiledMMAAttr
.