How to tune small M shape matmul? #9

leiwen83 · 2024-06-04T14:29:09Z

Hi~

Very nice post!
I see current the benchmark is targeting at shape M=N=K at various size, so if M is very small, like M=1 N=1792 K=5120, how could it be well handled in this case?

I check the sgemm result is 101 Gflops, but kenrel tune @10 only get 25.7 Gflops...

I think the irregular shape may bring some trouble the moving data around. Not sure whether you could provide some insight here for optimization.

Thx~

siboehm · 2024-06-04T14:50:30Z

For shapes like that I assume that cuBlas runs a split-k kernel :) It additionally splits on the reduction dimension which gains you extra parallelism but requires either atomics or a second kernel launch to compute the final result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to tune small M shape matmul? #9

How to tune small M shape matmul? #9

leiwen83 commented Jun 4, 2024 •

edited

Loading

siboehm commented Jun 4, 2024

How to tune small M shape matmul? #9

How to tune small M shape matmul? #9

Comments

leiwen83 commented Jun 4, 2024 • edited Loading

siboehm commented Jun 4, 2024

leiwen83 commented Jun 4, 2024 •

edited

Loading