You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Very nice post!
I see current the benchmark is targeting at shape M=N=K at various size, so if M is very small, like M=1 N=1792 K=5120, how could it be well handled in this case?
I check the sgemm result is 101 Gflops, but kenrel tune @10 only get 25.7 Gflops...
I think the irregular shape may bring some trouble the moving data around. Not sure whether you could provide some insight here for optimization.
Thx~
The text was updated successfully, but these errors were encountered:
For shapes like that I assume that cuBlas runs a split-k kernel :) It additionally splits on the reduction dimension which gains you extra parallelism but requires either atomics or a second kernel launch to compute the final result.
Hi~
Very nice post!
I see current the benchmark is targeting at shape M=N=K at various size, so if M is very small, like M=1 N=1792 K=5120, how could it be well handled in this case?
I check the sgemm result is 101 Gflops, but kenrel tune @10 only get 25.7 Gflops...
I think the irregular shape may bring some trouble the moving data around. Not sure whether you could provide some insight here for optimization.
Thx~
The text was updated successfully, but these errors were encountered: