[LLVMGPU] Add 32x32x16 F8 MFMA intrinsic #19106

raikonenfnu · 2024-11-11T22:57:25Z

To enable faster SDXL on attention we'd need different FP8 MFMA intrinsics. This 32x32x16 FP8 intrinsic (and virtual intrinsic for 2nd matmul) has been especially performant when used on this SDXL attention shape (B0: 2, B1: 10, (M, K2): 4096: K1: 64).

kuhar

Should we wait for #19098 from @bjacob to land first?

raikonenfnu · 2024-11-12T00:13:08Z

Should we wait for #19098 from @bjacob to land first?

Sounds good to me!

bjacob

This sounds time-sensitive (motivated by actual usage) so feel free to merge this. I can easily rebase #19098 and #19099 over that. They are respectively 2- and 3-deep into a chain of dependent PRs under review, so I can't guarantee that I'll be able to merge them tomorrow.

If I rebase my stuff around this, I will be tempted to support not just this particular intrinsic, but the whole family (all 2x2=4 combination of f8e4m3 / f8e5m2 data types for A and B). Not necessarily asking for this generalization to be made in this PR, if it's time sensitive, but definitely a suggestion :-)

raikonenfnu · 2024-11-12T04:47:49Z

This sounds time-sensitive (motivated by actual usage) so feel free to merge this. I can easily rebase #19098 and #19099 over that. They are respectively 2- and 3-deep into a chain of dependent PRs under review, so I can't guarantee that I'll be able to merge them tomorrow.

If I rebase my stuff around this, I will be tempted to support not just this particular intrinsic, but the whole family (all 2x2=4 combination of f8e4m3 / f8e5m2 data types for A and B). Not necessarily asking for this generalization to be made in this PR, if it's time sensitive, but definitely a suggestion :-)

@bjacob Thanks for being super considerate, this one is not super time sensitive, and I am not blocked on this for development. Please land your PRs first, they seem to be mostly landable anyways!

combination of f8e4m3 / f8e5m2 data types for A and B). Not necessarily asking for this generalization to be made in this PR, if it's time sensitive, but definitely a suggestion :-)

Haha sure thing! That'd save us another PR! :)

bjacob · 2024-11-12T17:00:05Z

Both my PRs are merged. You can now rebase this! Note: the numbering scheme for these enum values has changed in #19098. Please renumber accordingly!

To enable faster SDXL on attention we'd need different FP8 MFMA intrinsics. This 32x32x16 FP8 intrinsic (and virtual intrinsic for 2nd matmul) has been especially performant when used on this SDXL attention shape (B0: 2, B1: 10, (M, K2): 4096: K1: 64). Signed-off-by: Stanley Winata <[email protected]>

Signed-off-by: Stanley Winata <[email protected]>

bjacob

Looking great!

To enable faster SDXL on attention we'd need different FP8 MFMA intrinsics. This 32x32x16 FP8 intrinsic (and virtual intrinsic for 2nd matmul) has been especially performant when used on this SDXL attention shape (B0: 2, B1: 10, (M, K2): 4096: K1: 64). --------- Signed-off-by: Stanley Winata <[email protected]>

To enable faster SDXL on attention we'd need different FP8 MFMA intrinsics. This 32x32x16 FP8 intrinsic (and virtual intrinsic for 2nd matmul) has been especially performant when used on this SDXL attention shape (B0: 2, B1: 10, (M, K2): 4096: K1: 64). --------- Signed-off-by: Stanley Winata <[email protected]> Signed-off-by: Giacomo Serafini <[email protected]>

raikonenfnu requested review from kuhar, MaheshRavishankar, qedawkins, Groverkss and antiagainst as code owners November 11, 2024 22:57

kuhar reviewed Nov 12, 2024

View reviewed changes

bjacob self-requested a review November 12, 2024 01:41

bjacob approved these changes Nov 12, 2024

View reviewed changes

raikonenfnu force-pushed the newMfmaIntrinsic branch from c006dd4 to 854e675 Compare November 12, 2024 21:24

raikonenfnu requested review from bjacob and kuhar November 12, 2024 21:25

add virtual mfma for 32x32x16xf8e4m3

854e675

Signed-off-by: Stanley Winata <[email protected]>

bjacob approved these changes Nov 13, 2024

View reviewed changes

ScottTodd merged commit b08ea12 into iree-org:main Nov 13, 2024
31 of 36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLVMGPU] Add 32x32x16 F8 MFMA intrinsic #19106

[LLVMGPU] Add 32x32x16 F8 MFMA intrinsic #19106

raikonenfnu commented Nov 11, 2024

kuhar left a comment

raikonenfnu commented Nov 12, 2024

bjacob left a comment •

edited

Loading

raikonenfnu commented Nov 12, 2024

bjacob commented Nov 12, 2024

bjacob left a comment

[LLVMGPU] Add 32x32x16 F8 MFMA intrinsic #19106

[LLVMGPU] Add 32x32x16 F8 MFMA intrinsic #19106

Conversation

raikonenfnu commented Nov 11, 2024

kuhar left a comment

Choose a reason for hiding this comment

raikonenfnu commented Nov 12, 2024

bjacob left a comment • edited Loading

Choose a reason for hiding this comment

raikonenfnu commented Nov 12, 2024

bjacob commented Nov 12, 2024

bjacob left a comment

Choose a reason for hiding this comment

bjacob left a comment •

edited

Loading