Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Add support for materializing contraction ops. #18196

Conversation

hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Aug 13, 2024

To serve the needs of data-tiled contraction op, we can't simply use regular mma intrinsic because they implement more interface methods. The multi_mma op with data-tiled mma intrinsic should be much more simpler because we already relayout (i.e., pack + tile swizzle) the matrix. Thus, the revision introduces IREEGPU_DataTiledMMAAttr which gives the hints to multi_mma op.

The new attribute implements minimal interface methods. Only getABCElementTypes, getMNKShape and getSubgroupSize is implemented. They are all trivial methods which provides intrinsic details like amd_matrix_instruction_calculator provided. The implementation detail of getABCVectorTypes is not clear at this moment. We can potentially implement it based on the needs of distributing data-tiled multi_mma op to lanes. The other missing implementation is populateOperandOffsetsSizesStrides. Both methods will be used in the later lowering, and we need to implement them to connect e2e pieces.

Different from regular MMA attribute, this attributes has data-tiling concept which can model unrolling and interleaving layouts. In the concept of data-tiling, we always unroll the parallel dimensions (i.e., M, N dimensions) to be outermost, and interleave the unrolled K dimension. I.e., the unrolled K dimension becomes the innermost dimension. The constraint can be relaxed based on data-tiling needs. The additional information can be added to parameters.

With the new DataTiledMMAAttr, the revision teaches the multi_mma op to take unrolling factor into accounts in verifier.

The revision sorts out the ambiguities of materialized ops in data-tiling and switch MMAAttr to DataTiledMMAAttr in materialization patterns.

Note: this is not merging into main branch. This is mainly for prototype, and we will contribute it to main later on.

Copy link
Contributor

@Max191 Max191 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good, but we need to account for permutations of the inner tile in the multi_mma op. Take a look at my comment.

Comment on lines 295 to 298
auto mmaOp = rewriter.create<IREE::GPU::MultiMmaOp>(
loc, operands[0], operands[1], operands[2],
ArrayRef<AffineMap>{lhsMap, rhsMap, accMap},
iteratorTypes, kind.value());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The materialization patterns expect a column-major RHS inner tile layout, right? If that is the case, then I think the MultiMmaOp needs to have an rhs_permutation. We also need to account for permutations introduced by data tiling for each operand. The semantics of the op are that with no permutations, each input is assumed to have a row-major inner tile layout.

You can see in this test:

// -----
#contraction_accesses = [
affine_map<(i, j, k) -> (i, k)>,
affine_map<(i, j, k) -> (k, j)>,
affine_map<(i, j, k) -> (i, j)>
]
func.func @tensor_subgroup_matmul_transpose_b_32x32x8_multi_mma(
%lhs: tensor<?x?x32x8xf16>,
%rhs: tensor<?x?x32x8xf16>,
%acc: tensor<?x?x32x32xf32>) -> tensor<?x?x32x32xf32> {
%0 = iree_gpu.multi_mma %lhs, %rhs, %acc {
indexing_maps = #contraction_accesses,
iterator_types = [#iree_gpu.iterator_type<parallel>, #iree_gpu.iterator_type<parallel>, #iree_gpu.iterator_type<reduction>],
kind = #iree_gpu.mma_layout<MFMA_F32_32x32x8_F16>,
rhs_permutation = array<i64: 1, 0>
} : tensor<?x?x32x8xf16>, tensor<?x?x32x8xf16> into tensor<?x?x32x32xf32>
return %0 : tensor<?x?x32x32xf32>
}

The op has rhs_permutation = array<i64: 1, 0> when the rhs inner tile (32x8) is column-major (MxK). In the case of data tiling, the inner tile will also be expanded and permuted with some general tile swizzle. It may look something like this after the expand_shape:

#contraction_accesses = [
 affine_map<(i, j, k) -> (i, k)>,
 affine_map<(i, j, k) -> (k, j)>,
 affine_map<(i, j, k) -> (i, j)>
]
func.func @tensor_subgroup_matmul_transpose_b_32x32x8_multi_mma(
  %lhs: tensor<?x?x32x2x4xf16>,
  %rhs: tensor<?x?x32x2x4xf16>,
  %acc: tensor<?x?x4x2x4x32xf32>) -> tensor<?x?x4x2x4x32xf32> {
  %0 = iree_gpu.multi_mma %lhs, %rhs, %acc {
    indexing_maps = #contraction_accesses,
    iterator_types = [#iree_gpu.iterator_type<parallel>, #iree_gpu.iterator_type<parallel>, #iree_gpu.iterator_type<reduction>],
    kind = #iree_gpu.mma_layout<MFMA_F32_32x32x8_F16>,
    rhs_permutation = array<i64: 2, 0, 1>
  } : tensor<?x?x32x2x4xf16>, tensor<?x?x32x2x4xf16> into tensor<?x?x4x2x4x32xf32>
  return %0 : tensor<?x?x4x2x4x32xf32>
}

Now, rhs_permutation = array<i64: 2, 0, 1>, because the canonical layout after the expand_shape would be 2x4x32, but it is 32x2x4. If there is an additional transpose due to data tiling, then that transpose would also need to be composed with rhs_permutation here.

The materialization pattern should consider the transposes that happened on the lhs, rhs, and acc, and express them accordingly in the MultiMmaOp.

Side note: This is sort of contradictory to the assumption we had been using in our discussions that the inner tile dimensions are effectively opaque, and I believe it is part of the reason why we may need to extend the semantics (or mma enums) of the multi_mma op in order to do interleaving. Right now, the inner tile dims (of a single intrinsic tile) are separated from the outer tiles, but in data tiling we want to interleave them.

@qedawkins can you confirm that I am getting all the semantics correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, and thanks for pointing it out. I think you're talking about C-Matrix, but not B-Matrix in the example? I agree that we need to attach those permutation because we need a way to express different inner layout. So yes, here we want to add {1, 0} to LHS, RHS, and ACC.

I can't really interpret the case of rhs_permutation = array<i64: 2, 0, 1> because they are defined for MNK mapping. However, here the rank is 3 on RHS. I feel that something is missing, and we might need to extend the semantics in this case.

This operation can represent an intrinsic both in subgroup/warp and
distributed (thread) abstractions through the intrinsic attribute interface.
It does so semi-opaquely by including optional permutations of each MMA
fragment with respect to the "canonical" MNK row major matrix multiply.

After thinking a lot, I think I made a mistake in the shapes. It should be *x4x16 for LHS after materialization but not *x16x4. Because we already relayout inner tiles to column-major fashion. I'll rebase on #18135 and reuse the helpers.


// TODO(hanchung): Support batch gemms.
Location loc = op.getLoc();
SmallVector<int64_t> identityPerm = {0, 1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove unused variable.

@hanhanW hanhanW force-pushed the shared/gpu-data-tiling-materialize-encoding branch from 2b1310b to 2ee5060 Compare August 16, 2024 18:35
@hanhanW hanhanW force-pushed the gpu-contraction-to-multi-mma branch from 0091ebd to ef97f30 Compare August 16, 2024 18:59
@hanhanW
Copy link
Contributor Author

hanhanW commented Aug 16, 2024

rebased, I will address the Max's review comments and fix what I identified in the comment soon. Converting to draft for now.

@hanhanW hanhanW marked this pull request as draft August 16, 2024 19:00
@hanhanW hanhanW force-pushed the gpu-contraction-to-multi-mma branch 5 times, most recently from 2db1402 to effd193 Compare August 16, 2024 22:33
Comment on lines 413 to 455
// The current multi_mma op is not able to represent the data-tiled mma
// layouts. There are internal details about layout after we data-tile a
// matrix. The matrix is in row-major fashion for outer dimensions. The
// inner dimensions were in row-major, and then we swizzle them to be
// column-major. The number of inner dimensions is greater than 2 after tile
// swizzling. Here we use the "row-major" layout for inner dimensions. The
// shape is exacatly innerTile[0]xinnerTile[1] where innerTile can be
// derived from the encoding info. The chunk of the inner tiles are in
// column-major fashion, but it does not really matter. Because they are
// opaque. I don't find a way to represent it, so I added a
// `__is_data_tiled__` magic attribute to the op. The expectation is that
// there is a pass to resolve the `__is_data_tiled__` attribute before
// additional lowering. It should unroll the op and resolve the layout; each
// multi_mma op maps to a single intrisic at subgroup level. Then we do the
// lowering that we've been doing in TileAndFuse pipeline.
//
// E.g., say that the intrinsic is MFMA_f32_16x16x4_f32, and the unrolling
// factors for [M, N, K] are [2, 4, 1], then we have:
//
// iree_gpu.multi_mma %lhs, %rhs, %acc {
// kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>,
// lhs_permutation = array<i64: 0, 1>,
// rhs_permutation = array<i64: 0, 1>,
// __is_data_tiled__
// } : tensor<?x?x32x4xf32>, tensor<?x?x16x4xf32>
// into tensor<?x?x32x64xf32>
//
// The data-tiled chunk represents in row-major favor, so the permutation is
// identiy; the `__is_data_tiled__` attribute is attached.
//
// Then we resolve the `__is_data_tiled__` layout, which do the unrolling.
// It becomes a sequence of
//
// iree_gpu.multi_mma %lhs_i, %rhs_i, %acc_i {
// kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>,
// lhs_permutation = array<i64: 1, 0>,
// rhs_permutation = array<i64: 1, 0>,
// } : tensor<?x?x4x16xf32>, tensor<?x?x4x16xf32>
// into tensor<?x?x16x16xf32>
//
// This is what we usually have in TileAndFuse pipeline. IMO, this step
// happens after we distribute the op to subgroups/threads; before we
// distribute them to lanes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Max191 @qedawkins I took a stab at writing down one of potential path for data-tiled multi_mma op. The idea is documented in the comment, and I will try to prototype that. Let me know if you see something weird and we can discuss it next week.

(cc @lialan @pashu123 @bjacob @MaheshRavishankar )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    // Here we use the "row-major" layout for inner dimensions. The
    // shape is exactly innerTile[0]xinnerTile[1] where innerTile can be
    // derived from the encoding info. The chunk of the inner tiles are in
    // column-major fashion, but it does not really matter.

@hanhanW Could you explain this part?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shapes here don't match what I had in my head:

    // E.g., say that the intrinsic is MFMA_f32_16x16x4_f32, and the unrolling
    // factors for [M, N, K] are [2, 4, 1], then we have:
    //
    //   iree_gpu.multi_mma %lhs, %rhs, %acc {
    //     kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>,
    //     lhs_permutation = array<i64: 0, 1>,
    //     rhs_permutation = array<i64: 0, 1>,
    //     __is_data_tiled__
    //   } : tensor<?x?x32x4xf32>, tensor<?x?x16x4xf32>
    //     into tensor<?x?x32x64xf32>

My understanding was that the interleaved dimension will always be collapsed into the innermost dimension. So, in this example, with [2, 4, 1] unrolling, the shapes would look like:

    // E.g., say that the intrinsic is MFMA_f32_16x16x4_f32, and the unrolling
    // factors for [M, N, K] are [2, 4, 1], then we have:
    //
    //   iree_gpu.multi_mma %lhs, %rhs, %acc {
    //     kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>,
    //     lhs_permutation = array<i64: 1, 0>,
    //     rhs_permutation = array<i64: 1, 0>,
    //     __is_data_tiled__
    //   } : tensor<?x?x4x32xf32>, tensor<?x?x16x16xf32>
    //     into tensor<?x?x32x64xf32>

Which becomes a sequence of 8 single tiles:

    // E.g., say that the intrinsic is MFMA_f32_16x16x4_f32, and the unrolling
    // factors for [M, N, K] are [2, 4, 1], then we have:
    //
    //   iree_gpu.multi_mma %lhs, %rhs, %acc {
    //     kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>,
    //     lhs_permutation = array<i64: 1, 0>,
    //     rhs_permutation = array<i64: 1, 0>,
    //     __is_data_tiled__
    //   } : tensor<?x?x4x16xf32>, tensor<?x?x16x4xf32>
    //     into tensor<?x?x16x16xf32>

As I think about it more, I think we actually only need the __is_data_tiled__ to be able to infer the interleaved unrolling factors. Unrolling of M and N can be determined from the output shape, and the interleaved unrolling factor can be determined by the size of the inner dimensions on the LHS and RHS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to Max offline and we brainstorm what is missing and what needs to be done. I was over anchoring on the compatibility of CPU and abusing the opaque types in multi_mma op, which makes the inner tiles shape tricky. We can just add "resolveEncodingShape" to the type converter or whatever, and produce the shape in those binding ops. Then we don't need additional reshapes to make flow/hal binding ops happy.

The issue is that how do we represent the materialized contraction ops. There is no existing op that can express the op. The closest one is multi_mma op, but the definition does not support n-D inner tiles. Basically the *_permutation attributes are broken in the case. The op we're looking for is something similar to multi_mma without those permutation attributes. We think that we can still attach __is_data_tiled__ to the op and say that it has different semantics on inner tiles. The op is very useful because it already implements TilingInterface, and we can use that to do workgroup level distribution. My plan is using it in the prototype, and we can branch a new op when it becomes tricky.

(cc @lialan @pashu123 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am strongly -1 to another attribute on the multi_mma op. I would just add a new kind of attribute, maybe #iree_gpu.data_tiled_layout<#iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>> that captures what you need. Also the permutation fields simply describe the permutation of the input operands relative to the "canonical" layout. The meaning of those permutation fields depend on the mma_attr kind on the multi_mma op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The permutation is broken if we expand the inner tiles. They are no longer a single M/N/K dimension. The #iree_gpu.data-tiled_layout<*> looks better to me, but note that the permutation attribute would be broken. Or we can note that the permutations are not reasonable when the kind is data_tiled_layout. (My concern is that the op semantic is more complicated, but I agree that it is the easiest way to make it work.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The permutation attribute is not restricted to 2-D, in fact it gets expanded to >2D during the course of normal lowering without data tiling already, so expanding the inner tiles during MaterializeEncodings should be fine. AFAICR I didn't write any patterns w/ the multi-mma op that look at the dimensionality or shape of the inner dims; all of that logic is part of interface implementations of #iree_gpu.mma_layout. hence if we need different behavior for data tiling, then a new kind of mma_attr like iree_gpu.data_tiled_layout<...> is the way to go. Then the permutation fields are only there if they are useful, which perhaps they won't be for data tiling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, cool! I'll try to add the attribute and expand the shapes during materialization.

Copy link
Contributor Author

@hanhanW hanhanW Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#18294 is ready for review. I'll rebase the PR and add the new attribute.

@hanhanW hanhanW requested a review from Max191 August 16, 2024 22:34
@hanhanW hanhanW marked this pull request as ready for review August 16, 2024 22:34
@hanhanW hanhanW requested a review from lialan August 16, 2024 22:34
@hanhanW hanhanW marked this pull request as draft August 20, 2024 00:51
@hanhanW hanhanW force-pushed the gpu-contraction-to-multi-mma branch 3 times, most recently from 6429fdc to 15b2603 Compare August 21, 2024 03:13
@hanhanW hanhanW force-pushed the gpu-contraction-to-multi-mma branch from 15b2603 to 72b75eb Compare August 21, 2024 03:14
Comment on lines 206 to 221
int64_t expectedNumLhsElem = m * k;
int64_t expectedNumRhsElem = n * k;
int64_t expectedNumAccElem = m * n;

// The below variables have zero values when the op is at thread level.
int64_t M0K0 = lhsInnerElementCount / m / k; // (M0M1K0K1) / M1 / K1 = M0K0
int64_t N0K0 = rhsInnerElementCount / n / k; // (N0N1K0K1) / N1 / K1 = N0K0
int64_t M0N0 = accInnerElementCount / m / n; // (M0M1N0N1) / M1 / N1 = M0N0
if (hasPureTensorSemantics() && M0K0 && N0K0 && M0N0) {
int mUnrollFactor = std::sqrt(M0K0 * M0N0 / N0K0);
int nUnrollFactor = std::sqrt(M0N0 * N0K0 / M0K0);
int kUnrollFactor = std::sqrt(M0K0 * N0K0 / M0N0);
expectedNumLhsElem *= mUnrollFactor * kUnrollFactor;
expectedNumRhsElem *= nUnrollFactor * kUnrollFactor;
expectedNumAccElem *= mUnrollFactor * nUnrollFactor;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qedawkins I end up with making getMNKShape return intrinsic size. Because the attribute itself does not have shapes, so it really can't infer MNK shape. Instead, here is where the unrolling factor inference happens. I figured the formula to calculate the unrolling factors and documented it in the comments.

The missing implementation of interface method is populateOperandOffsetsSizesStrides. For those methods that we don't care in data-tiling, I just put an assertion to signal the failure.

Please take a look. We can discuss some details tomorrow. If you want to look at GPU ops/attributes changes, you can take a look at this commit: 72b75eb#diff-334a4fcfe53c3ad70f495b1a0b7e8d5f5f4e4181e04fd359193cbc3097cea68fR206-R221

(cc @bjacob @pashu123 @qedawkins @lialan )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot one aspect when we chatted yesterday and that is that getMNKShape is also used by hasThreadSemantics, which is required by the thread distribution pattern. Rather than allowing these unrolling factors on the inner dims (which has no meaning for any non-data tiled mma attr), I would just add a FailureOr<bool> innerShapeHasThreadSemantics interface method to MmaInterfaceAttr with a default implementation that checks the MNK shape like the current verifier does (and fails if the shape is invalid). We could use this to get stronger verification of the non-data tiled path too.

@hanhanW hanhanW marked this pull request as ready for review August 21, 2024 03:19
AttrBuilder<(ins "MMAIntrinsic":$intrinsic)>
AttrBuilder<(ins
"MMAIntrinsic":$intrinsic,
CArg<"bool", "false">:$isDataTiled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not yet convinced that isDataTiled is sufficient with a single boolean. I would much rather this just be a separate Attribute class that implements the MmaInterfaceAttr. This should make getOperandOffsetsSizesAndStrides much cleaner to implement. We can always make #iree_gpu.data_tiled_layout<#iree_gpu.mma_layout<MFMA_...>> take one of these attributes as a sub-attribute, allowing us to query all of the same interface methods + extra class declarations from the underlying mma attr. That will make it much easier to evolve data tiling implementations without mixing them with the default paths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isDataTiled boolean is functionally the same as adding an iree_gpu.data_tiled_layout wrapper class for this early prototype. Right now, the goal is to quickly lay down the base for GPU materialization for just the F32 intrinsic. Once that is working, there will be time to refactor the codegen details. Since this has been blocking progress for a while, and it is only landing to an experimental branch, I think it is fine to do for now. I agree this should not land like this to main, but when things are fleshed out and ready to land, it can be refactored.

Comment on lines 206 to 221
int64_t expectedNumLhsElem = m * k;
int64_t expectedNumRhsElem = n * k;
int64_t expectedNumAccElem = m * n;

// The below variables have zero values when the op is at thread level.
int64_t M0K0 = lhsInnerElementCount / m / k; // (M0M1K0K1) / M1 / K1 = M0K0
int64_t N0K0 = rhsInnerElementCount / n / k; // (N0N1K0K1) / N1 / K1 = N0K0
int64_t M0N0 = accInnerElementCount / m / n; // (M0M1N0N1) / M1 / N1 = M0N0
if (hasPureTensorSemantics() && M0K0 && N0K0 && M0N0) {
int mUnrollFactor = std::sqrt(M0K0 * M0N0 / N0K0);
int nUnrollFactor = std::sqrt(M0N0 * N0K0 / M0K0);
int kUnrollFactor = std::sqrt(M0K0 * N0K0 / M0N0);
expectedNumLhsElem *= mUnrollFactor * kUnrollFactor;
expectedNumRhsElem *= nUnrollFactor * kUnrollFactor;
expectedNumAccElem *= mUnrollFactor * nUnrollFactor;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot one aspect when we chatted yesterday and that is that getMNKShape is also used by hasThreadSemantics, which is required by the thread distribution pattern. Rather than allowing these unrolling factors on the inner dims (which has no meaning for any non-data tiled mma attr), I would just add a FailureOr<bool> innerShapeHasThreadSemantics interface method to MmaInterfaceAttr with a default implementation that checks the MNK shape like the current verifier does (and fails if the shape is invalid). We could use this to get stronger verification of the non-data tiled path too.

Copy link
Contributor

@Max191 Max191 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now. Ultimately, I agree with @qedawkins that we should add a new class for data tiled mma intrinsic types, but for now I am okay with this so we can continue making progress.

AttrBuilder<(ins "MMAIntrinsic":$intrinsic)>
AttrBuilder<(ins
"MMAIntrinsic":$intrinsic,
CArg<"bool", "false">:$isDataTiled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isDataTiled boolean is functionally the same as adding an iree_gpu.data_tiled_layout wrapper class for this early prototype. Right now, the goal is to quickly lay down the base for GPU materialization for just the F32 intrinsic. Once that is working, there will be time to refactor the codegen details. Since this has been blocking progress for a while, and it is only landing to an experimental branch, I think it is fine to do for now. I agree this should not land like this to main, but when things are fleshed out and ready to land, it can be refactored.

@hanhanW hanhanW force-pushed the gpu-contraction-to-multi-mma branch from ddb024b to ca0f273 Compare August 21, 2024 22:20
@hanhanW hanhanW requested a review from qedawkins August 21, 2024 22:23
let cppNamespace = "::mlir::iree_compiler::IREE::GPU";

let description = [{
An mma instruction for data tiling :)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I need to update this. :)

Comment on lines +900 to +901
std::tuple<VectorType, VectorType, VectorType>
DataTiledMMAAttr::getABCVectorTypes() const {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required by the interface, and I just duplicate the code from MMAAttr. We can perhaps just return a failure (i.e., {{}, {}, {}}) for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just returning failure SGTM too, up to you for this first version.

Copy link
Contributor

@qedawkins qedawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Quick skim LGTM (need to leave for a bit) but looks good for the branch. We'll want to build out the implementation a bit more before landing on main I think.

int64_t expectedNumAccElem = m * n;

if (auto dataTiledMmaAttr =
dyn_cast<IREE::GPU::DataTiledMMAAttr>(getKind())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to move this out of the MultiMmaOp when we go to land it in main but this is good for now.

Comment on lines +900 to +901
std::tuple<VectorType, VectorType, VectorType>
DataTiledMMAAttr::getABCVectorTypes() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just returning failure SGTM too, up to you for this first version.

@hanhanW hanhanW merged commit 6bc3c8b into iree-org:shared/gpu-data-tiling-materialize-encoding Aug 22, 2024
34 of 36 checks passed
@hanhanW hanhanW deleted the gpu-contraction-to-multi-mma branch August 22, 2024 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants