-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLVMGPU] Add 32x32x16 F8 MFMA intrinsic #19106
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds time-sensitive (motivated by actual usage) so feel free to merge this. I can easily rebase #19098 and #19099 over that. They are respectively 2- and 3-deep into a chain of dependent PRs under review, so I can't guarantee that I'll be able to merge them tomorrow.
If I rebase my stuff around this, I will be tempted to support not just this particular intrinsic, but the whole family (all 2x2=4 combination of f8e4m3 / f8e5m2 data types for A and B). Not necessarily asking for this generalization to be made in this PR, if it's time sensitive, but definitely a suggestion :-)
@bjacob Thanks for being super considerate, this one is not super time sensitive, and I am not blocked on this for development. Please land your PRs first, they seem to be mostly landable anyways!
Haha sure thing! That'd save us another PR! :) |
Both my PRs are merged. You can now rebase this! Note: the numbering scheme for these enum values has changed in #19098. Please renumber accordingly! |
To enable faster SDXL on attention we'd need different FP8 MFMA intrinsics. This 32x32x16 FP8 intrinsic (and virtual intrinsic for 2nd matmul) has been especially performant when used on this SDXL attention shape (B0: 2, B1: 10, (M, K2): 4096: K1: 64). Signed-off-by: Stanley Winata <[email protected]>
c006dd4
to
854e675
Compare
Signed-off-by: Stanley Winata <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great!
To enable faster SDXL on attention we'd need different FP8 MFMA intrinsics. This 32x32x16 FP8 intrinsic (and virtual intrinsic for 2nd matmul) has been especially performant when used on this SDXL attention shape (B0: 2, B1: 10, (M, K2): 4096: K1: 64). --------- Signed-off-by: Stanley Winata <[email protected]>
To enable faster SDXL on attention we'd need different FP8 MFMA intrinsics. This 32x32x16 FP8 intrinsic (and virtual intrinsic for 2nd matmul) has been especially performant when used on this SDXL attention shape (B0: 2, B1: 10, (M, K2): 4096: K1: 64). --------- Signed-off-by: Stanley Winata <[email protected]> Signed-off-by: Giacomo Serafini <[email protected]>
To enable faster SDXL on attention we'd need different FP8 MFMA intrinsics. This 32x32x16 FP8 intrinsic (and virtual intrinsic for 2nd matmul) has been especially performant when used on this SDXL attention shape (B0: 2, B1: 10, (M, K2): 4096: K1: 64).