New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[GPU] Add support for materializing contraction ops. #18196

Merged

hanhanW merged 9 commits into iree-org:shared/gpu-data-tiling-materialize-encoding from hanhanW:gpu-contraction-to-multi-mma

Aug 22, 2024

Contributor

hanhanW commented Aug 13, 2024 •

edited

Loading

To serve the needs of data-tiled contraction op, we can't simply use regular mma intrinsic because they implement more interface methods. The multi_mma op with data-tiled mma intrinsic should be much more simpler because we already relayout (i.e., pack + tile swizzle) the matrix. Thus, the revision introduces IREEGPU_DataTiledMMAAttr which gives the hints to multi_mma op.

The new attribute implements minimal interface methods. Only getABCElementTypes, getMNKShape and getSubgroupSize is implemented. They are all trivial methods which provides intrinsic details like amd_matrix_instruction_calculator provided. The implementation detail of getABCVectorTypes is not clear at this moment. We can potentially implement it based on the needs of distributing data-tiled multi_mma op to lanes. The other missing implementation is populateOperandOffsetsSizesStrides. Both methods will be used in the later lowering, and we need to implement them to connect e2e pieces.

Different from regular MMA attribute, this attributes has data-tiling concept which can model unrolling and interleaving layouts. In the concept of data-tiling, we always unroll the parallel dimensions (i.e., M, N dimensions) to be outermost, and interleave the unrolled K dimension. I.e., the unrolled K dimension becomes the innermost dimension. The constraint can be relaxed based on data-tiling needs. The additional information can be added to parameters.

With the new DataTiledMMAAttr, the revision teaches the multi_mma op to take unrolling factor into accounts in verifier.

The revision sorts out the ambiguities of materialized ops in data-tiling and switch MMAAttr to DataTiledMMAAttr in materialization patterns.

Note: this is not merging into main branch. This is mainly for prototype, and we will contribute it to main later on.

hanhanW requested review from lialan, MaheshRavishankar, pashu123 and Max191

August 13, 2024 00:21

hanhanW requested review from antiagainst and qedawkins as code owners

August 13, 2024 00:21

hanhanW commented

View reviewed changes

compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp Show resolved Hide resolved

Max191 requested changes

View reviewed changes

Contributor

Max191 left a comment

Overall, looks good, but we need to account for permutations of the inner tile in the multi_mma op. Take a look at my comment.

compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp Outdated Show resolved Hide resolved

compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp Outdated

Comment on lines 295 to 298

+                  auto mmaOp = rewriter.create<IREE::GPU::MultiMmaOp>(
+                      loc, operands[0], operands[1], operands[2],
+                      ArrayRef<AffineMap>{lhsMap, rhsMap, accMap},
+                      iteratorTypes, kind.value());

Contributor

Max191 Aug 13, 2024

The materialization patterns expect a column-major RHS inner tile layout, right? If that is the case, then I think the MultiMmaOp needs to have an rhs_permutation. We also need to account for permutations introduced by data tiling for each operand. The semantics of the op are that with no permutations, each input is assumed to have a row-major inner tile layout.

You can see in this test:

iree/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/test/iree_gpu_ops.mlir

Lines 190 to 209 in 3901e62

    
           // ----- 
        
           #contraction_accesses = [ 
        
            affine_map<(i, j, k) -> (i, k)>, 
        
            affine_map<(i, j, k) -> (k, j)>, 
        
            affine_map<(i, j, k) -> (i, j)> 
        
           ] 
        
           func.func @tensor_subgroup_matmul_transpose_b_32x32x8_multi_mma( 
        
             %lhs: tensor<?x?x32x8xf16>, 
        
             %rhs: tensor<?x?x32x8xf16>, 
        
             %acc: tensor<?x?x32x32xf32>) -> tensor<?x?x32x32xf32> { 
        
             %0 = iree_gpu.multi_mma %lhs, %rhs, %acc { 
        
               indexing_maps = #contraction_accesses, 
        
               iterator_types = [#iree_gpu.iterator_type<parallel>, #iree_gpu.iterator_type<parallel>, #iree_gpu.iterator_type<reduction>], 
        
               kind = #iree_gpu.mma_layout<MFMA_F32_32x32x8_F16>, 
        
               rhs_permutation = array<i64: 1, 0> 
        
             } : tensor<?x?x32x8xf16>, tensor<?x?x32x8xf16> into tensor<?x?x32x32xf32> 
        
             return %0 : tensor<?x?x32x32xf32> 
        
           }

The op has rhs_permutation = array<i64: 1, 0> when the rhs inner tile (32x8) is column-major (MxK). In the case of data tiling, the inner tile will also be expanded and permuted with some general tile swizzle. It may look something like this after the expand_shape:

#contraction_accesses = [
 affine_map<(i, j, k) -> (i, k)>,
 affine_map<(i, j, k) -> (k, j)>,
 affine_map<(i, j, k) -> (i, j)>
]
func.func @tensor_subgroup_matmul_transpose_b_32x32x8_multi_mma(
  %lhs: tensor<?x?x32x2x4xf16>,
  %rhs: tensor<?x?x32x2x4xf16>,
  %acc: tensor<?x?x4x2x4x32xf32>) -> tensor<?x?x4x2x4x32xf32> {
  %0 = iree_gpu.multi_mma %lhs, %rhs, %acc {
    indexing_maps = #contraction_accesses,
    iterator_types = [#iree_gpu.iterator_type<parallel>, #iree_gpu.iterator_type<parallel>, #iree_gpu.iterator_type<reduction>],
    kind = #iree_gpu.mma_layout<MFMA_F32_32x32x8_F16>,
    rhs_permutation = array<i64: 2, 0, 1>
  } : tensor<?x?x32x2x4xf16>, tensor<?x?x32x2x4xf16> into tensor<?x?x4x2x4x32xf32>
  return %0 : tensor<?x?x4x2x4x32xf32>
}

Now, rhs_permutation = array<i64: 2, 0, 1>, because the canonical layout after the expand_shape would be 2x4x32, but it is 32x2x4. If there is an additional transpose due to data tiling, then that transpose would also need to be composed with rhs_permutation here.

The materialization pattern should consider the transposes that happened on the lhs, rhs, and acc, and express them accordingly in the MultiMmaOp.

Side note: This is sort of contradictory to the assumption we had been using in our discussions that the inner tile dimensions are effectively opaque, and I believe it is part of the reason why we may need to extend the semantics (or mma enums) of the multi_mma op in order to do interleaving. Right now, the inner tile dims (of a single intrinsic tile) are separated from the outer tiles, but in data tiling we want to interleave them.

@qedawkins can you confirm that I am getting all the semantics correct?

Contributor Author

hanhanW Aug 13, 2024

Interesting, and thanks for pointing it out. I think you're talking about C-Matrix, but not B-Matrix in the example? I agree that we need to attach those permutation because we need a way to express different inner layout. So yes, here we want to add {1, 0} to LHS, RHS, and ACC.

I can't really interpret the case of rhs_permutation = array<i64: 2, 0, 1> because they are defined for MNK mapping. However, here the rank is 3 on RHS. I feel that something is missing, and we might need to extend the semantics in this case.

iree/compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.td

Lines 115 to 118 in 868f41e

    
           This operation can represent an intrinsic both in subgroup/warp and 
        
           distributed (thread) abstractions through the intrinsic attribute interface. 
        
           It does so semi-opaquely by including optional permutations of each MMA 
        
           fragment with respect to the "canonical" MNK row major matrix multiply.

After thinking a lot, I think I made a mistake in the shapes. It should be *x4x16 for LHS after materialization but not *x16x4. Because we already relayout inner tiles to column-major fashion. I'll rebase on #18135 and reuse the helpers.

compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp Outdated

+                  // TODO(hanchung): Support batch gemms.
+                  Location loc = op.getLoc();
+                  SmallVector<int64_t> identityPerm = {0, 1};

Contributor

Max191 Aug 13, 2024

nit: remove unused variable.

hanhanW force-pushed the shared/gpu-data-tiling-materialize-encoding branch from 2b1310b to 2ee5060 Compare

August 16, 2024 18:35

lialan approved these changes

View reviewed changes

hanhanW force-pushed the gpu-contraction-to-multi-mma branch from 0091ebd to ef97f30 Compare

August 16, 2024 18:59

Contributor Author

hanhanW commented Aug 16, 2024

rebased, I will address the Max's review comments and fix what I identified in the comment soon. Converting to draft for now.

hanhanW marked this pull request as draft

August 16, 2024 19:00

hanhanW force-pushed the gpu-contraction-to-multi-mma branch 5 times, most recently from 2db1402 to effd193 Compare

August 16, 2024 22:33

hanhanW commented

View reviewed changes

compiler/src/iree/compiler/Codegen/Common/GPU/GPUMaterializeEncoding.cpp Outdated

Comment on lines 413 to 455

+                  // The current multi_mma op is not able to represent the data-tiled mma
+                  // layouts. There are internal details about layout after we data-tile a
+                  // matrix. The matrix is in row-major fashion for outer dimensions. The
+                  // inner dimensions were in row-major, and then we swizzle them to be
+                  // column-major. The number of inner dimensions is greater than 2 after tile
+                  // swizzling. Here we use the "row-major" layout for inner dimensions. The
+                  // shape is exacatly innerTile[0]xinnerTile[1] where innerTile can be
+                  // derived from the encoding info. The chunk of the inner tiles are in
+                  // column-major fashion, but it does not really matter. Because they are
+                  // opaque. I don't find a way to represent it, so I added a
+                  // `__is_data_tiled__` magic attribute to the op. The expectation is that
+                  // there is a pass to resolve the `__is_data_tiled__` attribute before
+                  // additional lowering. It should unroll the op and resolve the layout; each
+                  // multi_mma op maps to a single intrisic at subgroup level. Then we do the
+                  // lowering that we've been doing in TileAndFuse pipeline.
+                  //
+                  // E.g., say that the intrinsic is MFMA_f32_16x16x4_f32, and the unrolling
+                  // factors for [M, N, K] are [2, 4, 1], then we have:
+                  //
+                  //   iree_gpu.multi_mma %lhs, %rhs, %acc {
+                  //     kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>,
+                  //     lhs_permutation = array<i64: 0, 1>,
+                  //     rhs_permutation = array<i64: 0, 1>,
+                  //     __is_data_tiled__
+                  //   } : tensor<?x?x32x4xf32>, tensor<?x?x16x4xf32>
+                  //     into tensor<?x?x32x64xf32>
+                  //
+                  // The data-tiled chunk represents in row-major favor, so the permutation is
+                  // identiy; the `__is_data_tiled__` attribute is attached.
+                  //
+                  // Then we resolve the `__is_data_tiled__` layout, which do the unrolling.
+                  // It becomes a sequence of
+                  //
+                  //   iree_gpu.multi_mma %lhs_i, %rhs_i, %acc_i {
+                  //     kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>,
+                  //     lhs_permutation = array<i64: 1, 0>,
+                  //     rhs_permutation = array<i64: 1, 0>,
+                  //   } : tensor<?x?x4x16xf32>, tensor<?x?x4x16xf32>
+                  //     into tensor<?x?x16x16xf32>
+                  //
+                  // This is what we usually have in TileAndFuse pipeline. IMO, this step
+                  // happens after we distribute the op to subgroups/threads; before we
+                  // distribute them to lanes.

Contributor Author

hanhanW Aug 16, 2024

@Max191 @qedawkins I took a stab at writing down one of potential path for data-tiled multi_mma op. The idea is documented in the comment, and I will try to prototype that. Let me know if you see something weird and we can discuss it next week.

(cc @lialan @pashu123 @bjacob @MaheshRavishankar )

Contributor

pashu123 Aug 17, 2024

    // Here we use the "row-major" layout for inner dimensions. The
    // shape is exactly innerTile[0]xinnerTile[1] where innerTile can be
    // derived from the encoding info. The chunk of the inner tiles are in
    // column-major fashion, but it does not really matter.

@hanhanW Could you explain this part?

Contributor

Max191 Aug 19, 2024

The shapes here don't match what I had in my head:

    // E.g., say that the intrinsic is MFMA_f32_16x16x4_f32, and the unrolling
    // factors for [M, N, K] are [2, 4, 1], then we have:
    //
    //   iree_gpu.multi_mma %lhs, %rhs, %acc {
    //     kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>,
    //     lhs_permutation = array<i64: 0, 1>,
    //     rhs_permutation = array<i64: 0, 1>,
    //     __is_data_tiled__
    //   } : tensor<?x?x32x4xf32>, tensor<?x?x16x4xf32>
    //     into tensor<?x?x32x64xf32>

My understanding was that the interleaved dimension will always be collapsed into the innermost dimension. So, in this example, with [2, 4, 1] unrolling, the shapes would look like:

    // E.g., say that the intrinsic is MFMA_f32_16x16x4_f32, and the unrolling
    // factors for [M, N, K] are [2, 4, 1], then we have:
    //
    //   iree_gpu.multi_mma %lhs, %rhs, %acc {
    //     kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>,
    //     lhs_permutation = array<i64: 1, 0>,
    //     rhs_permutation = array<i64: 1, 0>,
    //     __is_data_tiled__
    //   } : tensor<?x?x4x32xf32>, tensor<?x?x16x16xf32>
    //     into tensor<?x?x32x64xf32>

Which becomes a sequence of 8 single tiles:

    // E.g., say that the intrinsic is MFMA_f32_16x16x4_f32, and the unrolling
    // factors for [M, N, K] are [2, 4, 1], then we have:
    //
    //   iree_gpu.multi_mma %lhs, %rhs, %acc {
    //     kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>,
    //     lhs_permutation = array<i64: 1, 0>,
    //     rhs_permutation = array<i64: 1, 0>,
    //     __is_data_tiled__
    //   } : tensor<?x?x4x16xf32>, tensor<?x?x16x4xf32>
    //     into tensor<?x?x16x16xf32>

As I think about it more, I think we actually only need the __is_data_tiled__ to be able to infer the interleaved unrolling factors. Unrolling of M and N can be determined from the output shape, and the interleaved unrolling factor can be determined by the size of the inner dimensions on the LHS and RHS.

Contributor Author

hanhanW Aug 19, 2024

I talked to Max offline and we brainstorm what is missing and what needs to be done. I was over anchoring on the compatibility of CPU and abusing the opaque types in multi_mma op, which makes the inner tiles shape tricky. We can just add "resolveEncodingShape" to the type converter or whatever, and produce the shape in those binding ops. Then we don't need additional reshapes to make flow/hal binding ops happy.

The issue is that how do we represent the materialized contraction ops. There is no existing op that can express the op. The closest one is multi_mma op, but the definition does not support n-D inner tiles. Basically the *_permutation attributes are broken in the case. The op we're looking for is something similar to multi_mma without those permutation attributes. We think that we can still attach __is_data_tiled__ to the op and say that it has different semantics on inner tiles. The op is very useful because it already implements TilingInterface, and we can use that to do workgroup level distribution. My plan is using it in the prototype, and we can branch a new op when it becomes tricky.

(cc @lialan @pashu123 )

Contributor

qedawkins Aug 19, 2024

I am strongly -1 to another attribute on the multi_mma op. I would just add a new kind of attribute, maybe #iree_gpu.data_tiled_layout<#iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>> that captures what you need. Also the permutation fields simply describe the permutation of the input operands relative to the "canonical" layout. The meaning of those permutation fields depend on the mma_attr kind on the multi_mma op.

Contributor Author

hanhanW Aug 19, 2024

The permutation is broken if we expand the inner tiles. They are no longer a single M/N/K dimension. The #iree_gpu.data-tiled_layout<*> looks better to me, but note that the permutation attribute would be broken. Or we can note that the permutations are not reasonable when the kind is data_tiled_layout. (My concern is that the op semantic is more complicated, but I agree that it is the easiest way to make it work.)

Contributor

qedawkins Aug 19, 2024

The permutation attribute is not restricted to 2-D, in fact it gets expanded to >2D during the course of normal lowering without data tiling already, so expanding the inner tiles during MaterializeEncodings should be fine. AFAICR I didn't write any patterns w/ the multi-mma op that look at the dimensionality or shape of the inner dims; all of that logic is part of interface implementations of #iree_gpu.mma_layout. hence if we need different behavior for data tiling, then a new kind of mma_attr like iree_gpu.data_tiled_layout<...> is the way to go. Then the permutation fields are only there if they are useful, which perhaps they won't be for data tiling.

Contributor Author

hanhanW Aug 19, 2024

okay, cool! I'll try to add the attribute and expand the shapes during materialization.

Contributor Author

hanhanW Aug 20, 2024 •

edited

Loading

#18294 is ready for review. I'll rebase the PR and add the new attribute.

hanhanW requested a review from Max191

August 16, 2024 22:34

hanhanW marked this pull request as ready for review

August 16, 2024 22:34

hanhanW requested a review from lialan

August 16, 2024 22:34

hanhanW mentioned this pull request

[GPU][DT][NFC] Refactor a reshapeForInnerDims method for tile_swizzling. #18263

Merged

hanhanW marked this pull request as draft

August 20, 2024 00:51

hanhanW added 2 commits

August 20, 2024 13:01


          [GPU] Add support for materializing contraction ops.

57ccdbf

Signed-off-by: hanhanW <[email protected]>


          Address comments and write down some details in comments

376e705

Signed-off-by: hanhanW <[email protected]>

hanhanW force-pushed the gpu-contraction-to-multi-mma branch 3 times, most recently from 6429fdc to 15b2603 Compare

August 21, 2024 03:13


          Introduced data-tiled field to IREEGPU_MmaVectorLayoutAttr

72b75eb

Signed-off-by: hanhanW <[email protected]>

hanhanW force-pushed the gpu-contraction-to-multi-mma branch from 15b2603 to 72b75eb Compare

August 21, 2024 03:14

hanhanW commented

View reviewed changes

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.cpp Outdated

Comment on lines 206 to 221

+                int64_t expectedNumLhsElem = m * k;
+                int64_t expectedNumRhsElem = n * k;
+                int64_t expectedNumAccElem = m * n;
+                // The below variables have zero values when the op is at thread level.
+                int64_t M0K0 = lhsInnerElementCount / m / k; // (M0M1K0K1) / M1 / K1 = M0K0
+                int64_t N0K0 = rhsInnerElementCount / n / k; // (N0N1K0K1) / N1 / K1 = N0K0
+                int64_t M0N0 = accInnerElementCount / m / n; // (M0M1N0N1) / M1 / N1 = M0N0
+                if (hasPureTensorSemantics() && M0K0 && N0K0 && M0N0) {
+                  int mUnrollFactor = std::sqrt(M0K0 * M0N0 / N0K0);
+                  int nUnrollFactor = std::sqrt(M0N0 * N0K0 / M0K0);
+                  int kUnrollFactor = std::sqrt(M0K0 * N0K0 / M0N0);
+                  expectedNumLhsElem *= mUnrollFactor * kUnrollFactor;
+                  expectedNumRhsElem *= nUnrollFactor * kUnrollFactor;
+                  expectedNumAccElem *= mUnrollFactor * nUnrollFactor;
+                }

Contributor Author

hanhanW Aug 21, 2024

@qedawkins I end up with making getMNKShape return intrinsic size. Because the attribute itself does not have shapes, so it really can't infer MNK shape. Instead, here is where the unrolling factor inference happens. I figured the formula to calculate the unrolling factors and documented it in the comments.

The missing implementation of interface method is populateOperandOffsetsSizesStrides. For those methods that we don't care in data-tiling, I just put an assertion to signal the failure.

Please take a look. We can discuss some details tomorrow. If you want to look at GPU ops/attributes changes, you can take a look at this commit: 72b75eb#diff-334a4fcfe53c3ad70f495b1a0b7e8d5f5f4e4181e04fd359193cbc3097cea68fR206-R221

(cc @bjacob @pashu123 @qedawkins @lialan )

Contributor

qedawkins Aug 21, 2024

I forgot one aspect when we chatted yesterday and that is that getMNKShape is also used by hasThreadSemantics, which is required by the thread distribution pattern. Rather than allowing these unrolling factors on the inner dims (which has no meaning for any non-data tiled mma attr), I would just add a FailureOr<bool> innerShapeHasThreadSemantics interface method to MmaInterfaceAttr with a default implementation that checks the MNK shape like the current verifier does (and fails if the shape is invalid). We could use this to get stronger verification of the non-data tiled path too.

hanhanW marked this pull request as ready for review

August 21, 2024 03:19

qedawkins requested changes

View reviewed changes

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td Outdated

-                  AttrBuilder<(ins "MMAIntrinsic":$intrinsic)>
+                  AttrBuilder<(ins
+                    "MMAIntrinsic":$intrinsic,
+                    CArg<"bool", "false">:$isDataTiled

Contributor

qedawkins Aug 21, 2024

I am not yet convinced that isDataTiled is sufficient with a single boolean. I would much rather this just be a separate Attribute class that implements the MmaInterfaceAttr. This should make getOperandOffsetsSizesAndStrides much cleaner to implement. We can always make #iree_gpu.data_tiled_layout<#iree_gpu.mma_layout<MFMA_...>> take one of these attributes as a sub-attribute, allowing us to query all of the same interface methods + extra class declarations from the underlying mma attr. That will make it much easier to evolve data tiling implementations without mixing them with the default paths.

Contributor

Max191 Aug 21, 2024

The isDataTiled boolean is functionally the same as adding an iree_gpu.data_tiled_layout wrapper class for this early prototype. Right now, the goal is to quickly lay down the base for GPU materialization for just the F32 intrinsic. Once that is working, there will be time to refactor the codegen details. Since this has been blocking progress for a while, and it is only landing to an experimental branch, I think it is fine to do for now. I agree this should not land like this to main, but when things are fleshed out and ready to land, it can be refactored.

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.cpp Outdated

Comment on lines 206 to 221

+                int64_t expectedNumLhsElem = m * k;
+                int64_t expectedNumRhsElem = n * k;
+                int64_t expectedNumAccElem = m * n;
+                // The below variables have zero values when the op is at thread level.
+                int64_t M0K0 = lhsInnerElementCount / m / k; // (M0M1K0K1) / M1 / K1 = M0K0
+                int64_t N0K0 = rhsInnerElementCount / n / k; // (N0N1K0K1) / N1 / K1 = N0K0
+                int64_t M0N0 = accInnerElementCount / m / n; // (M0M1N0N1) / M1 / N1 = M0N0
+                if (hasPureTensorSemantics() && M0K0 && N0K0 && M0N0) {
+                  int mUnrollFactor = std::sqrt(M0K0 * M0N0 / N0K0);
+                  int nUnrollFactor = std::sqrt(M0N0 * N0K0 / M0K0);
+                  int kUnrollFactor = std::sqrt(M0K0 * N0K0 / M0N0);
+                  expectedNumLhsElem *= mUnrollFactor * kUnrollFactor;
+                  expectedNumRhsElem *= nUnrollFactor * kUnrollFactor;
+                  expectedNumAccElem *= mUnrollFactor * nUnrollFactor;
+                }

Contributor

qedawkins Aug 21, 2024

I forgot one aspect when we chatted yesterday and that is that getMNKShape is also used by hasThreadSemantics, which is required by the thread distribution pattern. Rather than allowing these unrolling factors on the inner dims (which has no meaning for any non-data tiled mma attr), I would just add a FailureOr<bool> innerShapeHasThreadSemantics interface method to MmaInterfaceAttr with a default implementation that checks the MNK shape like the current verifier does (and fails if the shape is invalid). We could use this to get stronger verification of the non-data tiled path too.

Max191 approved these changes

View reviewed changes

Contributor

Max191 left a comment

This looks good to me now. Ultimately, I agree with @qedawkins that we should add a new class for data tiled mma intrinsic types, but for now I am okay with this so we can continue making progress.

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td Outdated

-                  AttrBuilder<(ins "MMAIntrinsic":$intrinsic)>
+                  AttrBuilder<(ins
+                    "MMAIntrinsic":$intrinsic,
+                    CArg<"bool", "false">:$isDataTiled

Contributor

Max191 Aug 21, 2024

The isDataTiled boolean is functionally the same as adding an iree_gpu.data_tiled_layout wrapper class for this early prototype. Right now, the goal is to quickly lay down the base for GPU materialization for just the F32 intrinsic. Once that is working, there will be time to refactor the codegen details. Since this has been blocking progress for a while, and it is only landing to an experimental branch, I think it is fine to do for now. I agree this should not land like this to main, but when things are fleshed out and ready to land, it can be refactored.

hanhanW added 4 commits

August 21, 2024 15:20


          Revert "Introduced data-tiled field to IREEGPU_MmaVectorLayoutAttr"

953c391

This reverts commit 72b75eb.

Signed-off-by: hanhanW <[email protected]>


          Introduced DataTiledMMAAttribute

f1bc8a4

Signed-off-by: hanhanW <[email protected]>


          Teach multi_mma op about DataTiledMmaAttr.

1748edf

Signed-off-by: hanhanW <[email protected]>


          Switch to DataTiledMmaAttr

ca0f273

Signed-off-by: hanhanW <[email protected]>

hanhanW force-pushed the gpu-contraction-to-multi-mma branch from ddb024b to ca0f273 Compare

August 21, 2024 22:20

hanhanW requested a review from qedawkins

August 21, 2024 22:23


          Fix typo and clang-format

b8b692a

Signed-off-by: hanhanW <[email protected]>

hanhanW commented

View reviewed changes

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.td Outdated

+                let cppNamespace = "::mlir::iree_compiler::IREE::GPU";
+                let description = [{
+                  An mma instruction for data tiling :)

Contributor Author

hanhanW Aug 21, 2024

ah, I need to update this. :)

hanhanW commented

View reviewed changes

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp

Comment on lines +900 to +901

		std::tuple<VectorType, VectorType, VectorType>
		DataTiledMMAAttr::getABCVectorTypes() const {

Contributor Author

hanhanW Aug 21, 2024

This is required by the interface, and I just duplicate the code from MMAAttr. We can perhaps just return a failure (i.e., {{}, {}, {}}) for now.

Contributor

qedawkins Aug 21, 2024

just returning failure SGTM too, up to you for this first version.

qedawkins approved these changes

View reviewed changes

Contributor

qedawkins left a comment

Nice! Quick skim LGTM (need to leave for a bit) but looks good for the branch. We'll want to build out the implementation a bit more before landing on main I think.

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.cpp

+                int64_t expectedNumAccElem = m * n;
+                if (auto dataTiledMmaAttr =
+                        dyn_cast<IREE::GPU::DataTiledMMAAttr>(getKind())) {

Contributor

qedawkins Aug 21, 2024

It would be nice to move this out of the MultiMmaOp when we go to land it in main but this is good for now.

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp

Comment on lines +900 to +901

		std::tuple<VectorType, VectorType, VectorType>
		DataTiledMMAAttr::getABCVectorTypes() const {

Contributor

qedawkins Aug 21, 2024

just returning failure SGTM too, up to you for this first version.


          update comments and add TODOs

32e8bd8

Signed-off-by: hanhanW <[email protected]>

hanhanW merged commit 6bc3c8b into iree-org:shared/gpu-data-tiling-materialize-encoding

34 of 36 checks passed

hanhanW deleted the gpu-contraction-to-multi-mma branch

August 22, 2024 00:22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Max191 Max191 approved these changes

qedawkins qedawkins approved these changes

MaheshRavishankar Awaiting requested review from MaheshRavishankar

pashu123 Awaiting requested review from pashu123

antiagainst Awaiting requested review from antiagainst

lialan Awaiting requested review from lialan

Labels

None yet