-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Add support for materializing contraction ops. #18196
[GPU] Add support for materializing contraction ops. #18196
Conversation
compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good, but we need to account for permutations of the inner tile in the multi_mma op. Take a look at my comment.
compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp
Outdated
Show resolved
Hide resolved
auto mmaOp = rewriter.create<IREE::GPU::MultiMmaOp>( | ||
loc, operands[0], operands[1], operands[2], | ||
ArrayRef<AffineMap>{lhsMap, rhsMap, accMap}, | ||
iteratorTypes, kind.value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The materialization patterns expect a column-major RHS inner tile layout, right? If that is the case, then I think the MultiMmaOp needs to have an rhs_permutation
. We also need to account for permutations introduced by data tiling for each operand. The semantics of the op are that with no permutations, each input is assumed to have a row-major inner tile layout.
You can see in this test:
iree/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/test/iree_gpu_ops.mlir
Lines 190 to 209 in 3901e62
// ----- | |
#contraction_accesses = [ | |
affine_map<(i, j, k) -> (i, k)>, | |
affine_map<(i, j, k) -> (k, j)>, | |
affine_map<(i, j, k) -> (i, j)> | |
] | |
func.func @tensor_subgroup_matmul_transpose_b_32x32x8_multi_mma( | |
%lhs: tensor<?x?x32x8xf16>, | |
%rhs: tensor<?x?x32x8xf16>, | |
%acc: tensor<?x?x32x32xf32>) -> tensor<?x?x32x32xf32> { | |
%0 = iree_gpu.multi_mma %lhs, %rhs, %acc { | |
indexing_maps = #contraction_accesses, | |
iterator_types = [#iree_gpu.iterator_type<parallel>, #iree_gpu.iterator_type<parallel>, #iree_gpu.iterator_type<reduction>], | |
kind = #iree_gpu.mma_layout<MFMA_F32_32x32x8_F16>, | |
rhs_permutation = array<i64: 1, 0> | |
} : tensor<?x?x32x8xf16>, tensor<?x?x32x8xf16> into tensor<?x?x32x32xf32> | |
return %0 : tensor<?x?x32x32xf32> | |
} | |
The op has
rhs_permutation = array<i64: 1, 0>
when the rhs inner tile (32x8
) is column-major (MxK). In the case of data tiling, the inner tile will also be expanded and permuted with some general tile swizzle. It may look something like this after the expand_shape:
#contraction_accesses = [
affine_map<(i, j, k) -> (i, k)>,
affine_map<(i, j, k) -> (k, j)>,
affine_map<(i, j, k) -> (i, j)>
]
func.func @tensor_subgroup_matmul_transpose_b_32x32x8_multi_mma(
%lhs: tensor<?x?x32x2x4xf16>,
%rhs: tensor<?x?x32x2x4xf16>,
%acc: tensor<?x?x4x2x4x32xf32>) -> tensor<?x?x4x2x4x32xf32> {
%0 = iree_gpu.multi_mma %lhs, %rhs, %acc {
indexing_maps = #contraction_accesses,
iterator_types = [#iree_gpu.iterator_type<parallel>, #iree_gpu.iterator_type<parallel>, #iree_gpu.iterator_type<reduction>],
kind = #iree_gpu.mma_layout<MFMA_F32_32x32x8_F16>,
rhs_permutation = array<i64: 2, 0, 1>
} : tensor<?x?x32x2x4xf16>, tensor<?x?x32x2x4xf16> into tensor<?x?x4x2x4x32xf32>
return %0 : tensor<?x?x4x2x4x32xf32>
}
Now, rhs_permutation = array<i64: 2, 0, 1>
, because the canonical layout after the expand_shape would be 2x4x32
, but it is 32x2x4
. If there is an additional transpose due to data tiling, then that transpose would also need to be composed with rhs_permutation
here.
The materialization pattern should consider the transposes that happened on the lhs, rhs, and acc, and express them accordingly in the MultiMmaOp.
Side note: This is sort of contradictory to the assumption we had been using in our discussions that the inner tile dimensions are effectively opaque, and I believe it is part of the reason why we may need to extend the semantics (or mma enums) of the multi_mma op in order to do interleaving. Right now, the inner tile dims (of a single intrinsic tile) are separated from the outer tiles, but in data tiling we want to interleave them.
@qedawkins can you confirm that I am getting all the semantics correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, and thanks for pointing it out. I think you're talking about C-Matrix, but not B-Matrix in the example? I agree that we need to attach those permutation because we need a way to express different inner layout. So yes, here we want to add {1, 0}
to LHS, RHS, and ACC.
I can't really interpret the case of rhs_permutation = array<i64: 2, 0, 1>
because they are defined for MNK
mapping. However, here the rank is 3 on RHS. I feel that something is missing, and we might need to extend the semantics in this case.
iree/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.td
Lines 115 to 118 in 868f41e
This operation can represent an intrinsic both in subgroup/warp and | |
distributed (thread) abstractions through the intrinsic attribute interface. | |
It does so semi-opaquely by including optional permutations of each MMA | |
fragment with respect to the "canonical" MNK row major matrix multiply. |
After thinking a lot, I think I made a mistake in the shapes. It should be *x4x16
for LHS after materialization but not *x16x4
. Because we already relayout inner tiles to column-major fashion. I'll rebase on #18135 and reuse the helpers.
|
||
// TODO(hanchung): Support batch gemms. | ||
Location loc = op.getLoc(); | ||
SmallVector<int64_t> identityPerm = {0, 1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove unused variable.
2b1310b
to
2ee5060
Compare
0091ebd
to
ef97f30
Compare
rebased, I will address the Max's review comments and fix what I identified in the comment soon. Converting to draft for now. |
2db1402
to
effd193
Compare
// The current multi_mma op is not able to represent the data-tiled mma | ||
// layouts. There are internal details about layout after we data-tile a | ||
// matrix. The matrix is in row-major fashion for outer dimensions. The | ||
// inner dimensions were in row-major, and then we swizzle them to be | ||
// column-major. The number of inner dimensions is greater than 2 after tile | ||
// swizzling. Here we use the "row-major" layout for inner dimensions. The | ||
// shape is exacatly innerTile[0]xinnerTile[1] where innerTile can be | ||
// derived from the encoding info. The chunk of the inner tiles are in | ||
// column-major fashion, but it does not really matter. Because they are | ||
// opaque. I don't find a way to represent it, so I added a | ||
// `__is_data_tiled__` magic attribute to the op. The expectation is that | ||
// there is a pass to resolve the `__is_data_tiled__` attribute before | ||
// additional lowering. It should unroll the op and resolve the layout; each | ||
// multi_mma op maps to a single intrisic at subgroup level. Then we do the | ||
// lowering that we've been doing in TileAndFuse pipeline. | ||
// | ||
// E.g., say that the intrinsic is MFMA_f32_16x16x4_f32, and the unrolling | ||
// factors for [M, N, K] are [2, 4, 1], then we have: | ||
// | ||
// iree_gpu.multi_mma %lhs, %rhs, %acc { | ||
// kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>, | ||
// lhs_permutation = array<i64: 0, 1>, | ||
// rhs_permutation = array<i64: 0, 1>, | ||
// __is_data_tiled__ | ||
// } : tensor<?x?x32x4xf32>, tensor<?x?x16x4xf32> | ||
// into tensor<?x?x32x64xf32> | ||
// | ||
// The data-tiled chunk represents in row-major favor, so the permutation is | ||
// identiy; the `__is_data_tiled__` attribute is attached. | ||
// | ||
// Then we resolve the `__is_data_tiled__` layout, which do the unrolling. | ||
// It becomes a sequence of | ||
// | ||
// iree_gpu.multi_mma %lhs_i, %rhs_i, %acc_i { | ||
// kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>, | ||
// lhs_permutation = array<i64: 1, 0>, | ||
// rhs_permutation = array<i64: 1, 0>, | ||
// } : tensor<?x?x4x16xf32>, tensor<?x?x4x16xf32> | ||
// into tensor<?x?x16x16xf32> | ||
// | ||
// This is what we usually have in TileAndFuse pipeline. IMO, this step | ||
// happens after we distribute the op to subgroups/threads; before we | ||
// distribute them to lanes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Max191 @qedawkins I took a stab at writing down one of potential path for data-tiled multi_mma op. The idea is documented in the comment, and I will try to prototype that. Let me know if you see something weird and we can discuss it next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Here we use the "row-major" layout for inner dimensions. The
// shape is exactly innerTile[0]xinnerTile[1] where innerTile can be
// derived from the encoding info. The chunk of the inner tiles are in
// column-major fashion, but it does not really matter.
@hanhanW Could you explain this part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shapes here don't match what I had in my head:
// E.g., say that the intrinsic is MFMA_f32_16x16x4_f32, and the unrolling
// factors for [M, N, K] are [2, 4, 1], then we have:
//
// iree_gpu.multi_mma %lhs, %rhs, %acc {
// kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>,
// lhs_permutation = array<i64: 0, 1>,
// rhs_permutation = array<i64: 0, 1>,
// __is_data_tiled__
// } : tensor<?x?x32x4xf32>, tensor<?x?x16x4xf32>
// into tensor<?x?x32x64xf32>
My understanding was that the interleaved dimension will always be collapsed into the innermost dimension. So, in this example, with [2, 4, 1]
unrolling, the shapes would look like:
// E.g., say that the intrinsic is MFMA_f32_16x16x4_f32, and the unrolling
// factors for [M, N, K] are [2, 4, 1], then we have:
//
// iree_gpu.multi_mma %lhs, %rhs, %acc {
// kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>,
// lhs_permutation = array<i64: 1, 0>,
// rhs_permutation = array<i64: 1, 0>,
// __is_data_tiled__
// } : tensor<?x?x4x32xf32>, tensor<?x?x16x16xf32>
// into tensor<?x?x32x64xf32>
Which becomes a sequence of 8 single tiles:
// E.g., say that the intrinsic is MFMA_f32_16x16x4_f32, and the unrolling
// factors for [M, N, K] are [2, 4, 1], then we have:
//
// iree_gpu.multi_mma %lhs, %rhs, %acc {
// kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>,
// lhs_permutation = array<i64: 1, 0>,
// rhs_permutation = array<i64: 1, 0>,
// __is_data_tiled__
// } : tensor<?x?x4x16xf32>, tensor<?x?x16x4xf32>
// into tensor<?x?x16x16xf32>
As I think about it more, I think we actually only need the __is_data_tiled__
to be able to infer the interleaved unrolling factors. Unrolling of M and N can be determined from the output shape, and the interleaved unrolling factor can be determined by the size of the inner dimensions on the LHS and RHS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked to Max offline and we brainstorm what is missing and what needs to be done. I was over anchoring on the compatibility of CPU and abusing the opaque types in multi_mma op, which makes the inner tiles shape tricky. We can just add "resolveEncodingShape" to the type converter or whatever, and produce the shape in those binding ops. Then we don't need additional reshapes to make flow/hal binding ops happy.
The issue is that how do we represent the materialized contraction ops. There is no existing op that can express the op. The closest one is multi_mma
op, but the definition does not support n-D inner tiles. Basically the *_permutation
attributes are broken in the case. The op we're looking for is something similar to multi_mma
without those permutation attributes. We think that we can still attach __is_data_tiled__
to the op and say that it has different semantics on inner tiles. The op is very useful because it already implements TilingInterface, and we can use that to do workgroup level distribution. My plan is using it in the prototype, and we can branch a new op when it becomes tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am strongly -1 to another attribute on the multi_mma
op. I would just add a new kind of attribute, maybe #iree_gpu.data_tiled_layout<#iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>>
that captures what you need. Also the permutation fields simply describe the permutation of the input operands relative to the "canonical" layout. The meaning of those permutation fields depend on the mma_attr
kind on the multi_mma
op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The permutation is broken if we expand the inner tiles. They are no longer a single M/N/K dimension. The #iree_gpu.data-tiled_layout<*>
looks better to me, but note that the permutation attribute would be broken. Or we can note that the permutations are not reasonable when the kind is data_tiled_layout. (My concern is that the op semantic is more complicated, but I agree that it is the easiest way to make it work.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The permutation attribute is not restricted to 2-D, in fact it gets expanded to >2D during the course of normal lowering without data tiling already, so expanding the inner tiles during MaterializeEncodings should be fine. AFAICR I didn't write any patterns w/ the multi-mma op that look at the dimensionality or shape of the inner dims; all of that logic is part of interface implementations of #iree_gpu.mma_layout
. hence if we need different behavior for data tiling, then a new kind of mma_attr like iree_gpu.data_tiled_layout<...>
is the way to go. Then the permutation fields are only there if they are useful, which perhaps they won't be for data tiling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, cool! I'll try to add the attribute and expand the shapes during materialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#18294 is ready for review. I'll rebase the PR and add the new attribute.
Signed-off-by: hanhanW <[email protected]>
Signed-off-by: hanhanW <[email protected]>
6429fdc
to
15b2603
Compare
Signed-off-by: hanhanW <[email protected]>
15b2603
to
72b75eb
Compare
int64_t expectedNumLhsElem = m * k; | ||
int64_t expectedNumRhsElem = n * k; | ||
int64_t expectedNumAccElem = m * n; | ||
|
||
// The below variables have zero values when the op is at thread level. | ||
int64_t M0K0 = lhsInnerElementCount / m / k; // (M0M1K0K1) / M1 / K1 = M0K0 | ||
int64_t N0K0 = rhsInnerElementCount / n / k; // (N0N1K0K1) / N1 / K1 = N0K0 | ||
int64_t M0N0 = accInnerElementCount / m / n; // (M0M1N0N1) / M1 / N1 = M0N0 | ||
if (hasPureTensorSemantics() && M0K0 && N0K0 && M0N0) { | ||
int mUnrollFactor = std::sqrt(M0K0 * M0N0 / N0K0); | ||
int nUnrollFactor = std::sqrt(M0N0 * N0K0 / M0K0); | ||
int kUnrollFactor = std::sqrt(M0K0 * N0K0 / M0N0); | ||
expectedNumLhsElem *= mUnrollFactor * kUnrollFactor; | ||
expectedNumRhsElem *= nUnrollFactor * kUnrollFactor; | ||
expectedNumAccElem *= mUnrollFactor * nUnrollFactor; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qedawkins I end up with making getMNKShape return intrinsic size. Because the attribute itself does not have shapes, so it really can't infer MNK shape. Instead, here is where the unrolling factor inference happens. I figured the formula to calculate the unrolling factors and documented it in the comments.
The missing implementation of interface method is populateOperandOffsetsSizesStrides
. For those methods that we don't care in data-tiling, I just put an assertion to signal the failure.
Please take a look. We can discuss some details tomorrow. If you want to look at GPU ops/attributes changes, you can take a look at this commit: 72b75eb#diff-334a4fcfe53c3ad70f495b1a0b7e8d5f5f4e4181e04fd359193cbc3097cea68fR206-R221
(cc @bjacob @pashu123 @qedawkins @lialan )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot one aspect when we chatted yesterday and that is that getMNKShape
is also used by hasThreadSemantics
, which is required by the thread distribution pattern. Rather than allowing these unrolling factors on the inner dims (which has no meaning for any non-data tiled mma attr), I would just add a FailureOr<bool> innerShapeHasThreadSemantics
interface method to MmaInterfaceAttr
with a default implementation that checks the MNK shape like the current verifier does (and fails if the shape is invalid). We could use this to get stronger verification of the non-data tiled path too.
AttrBuilder<(ins "MMAIntrinsic":$intrinsic)> | ||
AttrBuilder<(ins | ||
"MMAIntrinsic":$intrinsic, | ||
CArg<"bool", "false">:$isDataTiled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not yet convinced that isDataTiled
is sufficient with a single boolean. I would much rather this just be a separate Attribute class that implements the MmaInterfaceAttr
. This should make getOperandOffsetsSizesAndStrides
much cleaner to implement. We can always make #iree_gpu.data_tiled_layout<#iree_gpu.mma_layout<MFMA_...>>
take one of these attributes as a sub-attribute, allowing us to query all of the same interface methods + extra class declarations from the underlying mma attr. That will make it much easier to evolve data tiling implementations without mixing them with the default paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isDataTiled
boolean is functionally the same as adding an iree_gpu.data_tiled_layout
wrapper class for this early prototype. Right now, the goal is to quickly lay down the base for GPU materialization for just the F32 intrinsic. Once that is working, there will be time to refactor the codegen details. Since this has been blocking progress for a while, and it is only landing to an experimental branch, I think it is fine to do for now. I agree this should not land like this to main, but when things are fleshed out and ready to land, it can be refactored.
int64_t expectedNumLhsElem = m * k; | ||
int64_t expectedNumRhsElem = n * k; | ||
int64_t expectedNumAccElem = m * n; | ||
|
||
// The below variables have zero values when the op is at thread level. | ||
int64_t M0K0 = lhsInnerElementCount / m / k; // (M0M1K0K1) / M1 / K1 = M0K0 | ||
int64_t N0K0 = rhsInnerElementCount / n / k; // (N0N1K0K1) / N1 / K1 = N0K0 | ||
int64_t M0N0 = accInnerElementCount / m / n; // (M0M1N0N1) / M1 / N1 = M0N0 | ||
if (hasPureTensorSemantics() && M0K0 && N0K0 && M0N0) { | ||
int mUnrollFactor = std::sqrt(M0K0 * M0N0 / N0K0); | ||
int nUnrollFactor = std::sqrt(M0N0 * N0K0 / M0K0); | ||
int kUnrollFactor = std::sqrt(M0K0 * N0K0 / M0N0); | ||
expectedNumLhsElem *= mUnrollFactor * kUnrollFactor; | ||
expectedNumRhsElem *= nUnrollFactor * kUnrollFactor; | ||
expectedNumAccElem *= mUnrollFactor * nUnrollFactor; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot one aspect when we chatted yesterday and that is that getMNKShape
is also used by hasThreadSemantics
, which is required by the thread distribution pattern. Rather than allowing these unrolling factors on the inner dims (which has no meaning for any non-data tiled mma attr), I would just add a FailureOr<bool> innerShapeHasThreadSemantics
interface method to MmaInterfaceAttr
with a default implementation that checks the MNK shape like the current verifier does (and fails if the shape is invalid). We could use this to get stronger verification of the non-data tiled path too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me now. Ultimately, I agree with @qedawkins that we should add a new class for data tiled mma intrinsic types, but for now I am okay with this so we can continue making progress.
AttrBuilder<(ins "MMAIntrinsic":$intrinsic)> | ||
AttrBuilder<(ins | ||
"MMAIntrinsic":$intrinsic, | ||
CArg<"bool", "false">:$isDataTiled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isDataTiled
boolean is functionally the same as adding an iree_gpu.data_tiled_layout
wrapper class for this early prototype. Right now, the goal is to quickly lay down the base for GPU materialization for just the F32 intrinsic. Once that is working, there will be time to refactor the codegen details. Since this has been blocking progress for a while, and it is only landing to an experimental branch, I think it is fine to do for now. I agree this should not land like this to main, but when things are fleshed out and ready to land, it can be refactored.
This reverts commit 72b75eb. Signed-off-by: hanhanW <[email protected]>
Signed-off-by: hanhanW <[email protected]>
Signed-off-by: hanhanW <[email protected]>
Signed-off-by: hanhanW <[email protected]>
ddb024b
to
ca0f273
Compare
Signed-off-by: hanhanW <[email protected]>
let cppNamespace = "::mlir::iree_compiler::IREE::GPU"; | ||
|
||
let description = [{ | ||
An mma instruction for data tiling :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I need to update this. :)
std::tuple<VectorType, VectorType, VectorType> | ||
DataTiledMMAAttr::getABCVectorTypes() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required by the interface, and I just duplicate the code from MMAAttr. We can perhaps just return a failure (i.e., {{}, {}, {}}
) for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just returning failure SGTM too, up to you for this first version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Quick skim LGTM (need to leave for a bit) but looks good for the branch. We'll want to build out the implementation a bit more before landing on main I think.
int64_t expectedNumAccElem = m * n; | ||
|
||
if (auto dataTiledMmaAttr = | ||
dyn_cast<IREE::GPU::DataTiledMMAAttr>(getKind())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to move this out of the MultiMmaOp when we go to land it in main but this is good for now.
std::tuple<VectorType, VectorType, VectorType> | ||
DataTiledMMAAttr::getABCVectorTypes() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just returning failure SGTM too, up to you for this first version.
Signed-off-by: hanhanW <[email protected]>
6bc3c8b
into
iree-org:shared/gpu-data-tiling-materialize-encoding
To serve the needs of data-tiled contraction op, we can't simply use regular mma intrinsic because they implement more interface methods. The multi_mma op with data-tiled mma intrinsic should be much more simpler because we already relayout (i.e., pack + tile swizzle) the matrix. Thus, the revision introduces
IREEGPU_DataTiledMMAAttr
which gives the hints to multi_mma op.The new attribute implements minimal interface methods. Only
getABCElementTypes
,getMNKShape
andgetSubgroupSize
is implemented. They are all trivial methods which provides intrinsic details like amd_matrix_instruction_calculator provided. The implementation detail ofgetABCVectorTypes
is not clear at this moment. We can potentially implement it based on the needs of distributing data-tiled multi_mma op to lanes. The other missing implementation ispopulateOperandOffsetsSizesStrides
. Both methods will be used in the later lowering, and we need to implement them to connect e2e pieces.Different from regular MMA attribute, this attributes has data-tiling concept which can model unrolling and interleaving layouts. In the concept of data-tiling, we always unroll the parallel dimensions (i.e., M, N dimensions) to be outermost, and interleave the unrolled K dimension. I.e., the unrolled K dimension becomes the innermost dimension. The constraint can be relaxed based on data-tiling needs. The additional information can be added to
parameters
.With the new DataTiledMMAAttr, the revision teaches the multi_mma op to take unrolling factor into accounts in verifier.
The revision sorts out the ambiguities of materialized ops in data-tiling and switch MMAAttr to DataTiledMMAAttr in materialization patterns.
Note: this is not merging into main branch. This is mainly for prototype, and we will contribute it to main later on.