Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLVMGPU] Add 32x32x16 F8 MFMA intrinsic #19106

Merged
merged 2 commits into from
Nov 13, 2024

Conversation

raikonenfnu
Copy link
Collaborator

To enable faster SDXL on attention we'd need different FP8 MFMA intrinsics. This 32x32x16 FP8 intrinsic (and virtual intrinsic for 2nd matmul) has been especially performant when used on this SDXL attention shape (B0: 2, B1: 10, (M, K2): 4096: K1: 64).

Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we wait for #19098 from @bjacob to land first?

@raikonenfnu
Copy link
Collaborator Author

Should we wait for #19098 from @bjacob to land first?

Sounds good to me!

@bjacob bjacob self-requested a review November 12, 2024 01:41
Copy link
Contributor

@bjacob bjacob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds time-sensitive (motivated by actual usage) so feel free to merge this. I can easily rebase #19098 and #19099 over that. They are respectively 2- and 3-deep into a chain of dependent PRs under review, so I can't guarantee that I'll be able to merge them tomorrow.

If I rebase my stuff around this, I will be tempted to support not just this particular intrinsic, but the whole family (all 2x2=4 combination of f8e4m3 / f8e5m2 data types for A and B). Not necessarily asking for this generalization to be made in this PR, if it's time sensitive, but definitely a suggestion :-)

@raikonenfnu
Copy link
Collaborator Author

This sounds time-sensitive (motivated by actual usage) so feel free to merge this. I can easily rebase #19098 and #19099 over that. They are respectively 2- and 3-deep into a chain of dependent PRs under review, so I can't guarantee that I'll be able to merge them tomorrow.

If I rebase my stuff around this, I will be tempted to support not just this particular intrinsic, but the whole family (all 2x2=4 combination of f8e4m3 / f8e5m2 data types for A and B). Not necessarily asking for this generalization to be made in this PR, if it's time sensitive, but definitely a suggestion :-)

@bjacob Thanks for being super considerate, this one is not super time sensitive, and I am not blocked on this for development. Please land your PRs first, they seem to be mostly landable anyways!

combination of f8e4m3 / f8e5m2 data types for A and B). Not necessarily asking for this generalization to be made in this PR, if it's time sensitive, but definitely a suggestion :-)

Haha sure thing! That'd save us another PR! :)

@bjacob
Copy link
Contributor

bjacob commented Nov 12, 2024

Both my PRs are merged. You can now rebase this! Note: the numbering scheme for these enum values has changed in #19098. Please renumber accordingly!

To enable faster SDXL on attention we'd need different FP8 MFMA intrinsics. This 32x32x16 FP8 intrinsic (and virtual intrinsic for 2nd matmul) has been especially performant when used on this SDXL attention shape (B0: 2, B1: 10, (M, K2): 4096: K1: 64).

Signed-off-by: Stanley Winata <[email protected]>
Copy link
Contributor

@bjacob bjacob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great!

@ScottTodd ScottTodd merged commit b08ea12 into iree-org:main Nov 13, 2024
31 of 36 checks passed
Groverkss pushed a commit to Groverkss/iree that referenced this pull request Dec 1, 2024
To enable faster SDXL on attention we'd need different FP8 MFMA
intrinsics. This 32x32x16 FP8 intrinsic (and virtual intrinsic for 2nd
matmul) has been especially performant when used on this SDXL attention
shape (B0: 2, B1: 10, (M, K2): 4096: K1: 64).

---------

Signed-off-by: Stanley Winata <[email protected]>
giacs-epic pushed a commit to giacs-epic/iree that referenced this pull request Dec 4, 2024
To enable faster SDXL on attention we'd need different FP8 MFMA
intrinsics. This 32x32x16 FP8 intrinsic (and virtual intrinsic for 2nd
matmul) has been especially performant when used on this SDXL attention
shape (B0: 2, B1: 10, (M, K2): 4096: K1: 64).

---------

Signed-off-by: Stanley Winata <[email protected]>
Signed-off-by: Giacomo Serafini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants