[feature request] Whisper with openblas #52

bil-ash · 2024-01-22T02:29:36Z

First of all thanks for this cool project. Now since you have started adding support for models other than stable diffusion, please also add support for whisper with W8A8 quantization.
Also, seems xnnpack is for speeding up float operations. So does that mean for W8A8 inference xnnpack is not required?
Also, consider adding openblas as a drop-in replacement for cublas so that gpu acceleration can also be used on intel and AMD CPUs with integrated graphics.

vitoplantamura · 2024-01-27T19:45:48Z

hi, sorry for the late answer. XNNPACK provides a set of operators for quantized operations (including 8-bit operations) as well. It may seem counterintuitive, but making "fast" 8-bit operators is more complex than making fast float operators, for example. Regarding OpenBLAS, I don't know: I was actually thinking of hipBLAS (for AMD GPUs), since the cost in terms of code changes should be almost zero. However I will take a look at OpenBLAS to understand how complex the integration would be. I'm no Whisper expert, but with projects like "insanely-fast-whisper", would it make sense? However the idea of "Whisper on the Raspberry PI Zero 2" could be more interesting, perhaps :-) Thanks, Vito

bil-ash · 2024-01-28T01:25:49Z

Okay, got the point regarding XNNPACK.

I guess you should implement hipBLAS first because you will have to make minimal changes. Just please allow overiding the GPU using AMDGPU_TARGETS like llama.cpp . Actually, I have a machine with AMD GPU which is not officially supported by hipBLAS and overriding helps me to run llama.cpp (with better performance than CPU) but I can't run anything which dosen't support overriding GPU , for example the onnxstream llm demo with GPU. So, I was asking for OpenBLAS support.

insanely-fast-whisper is aimed for servers with GPU. You could aim for CPU. Also, till now whatever whisper quantization for CPU inference implementations I have seen, all do only weight quantization. You could do both weight and activation quantization thereby reducing disk and memory usage.

vitoplantamura · 2024-01-28T20:13:30Z

ok, got it. I will definitely look at how AMDGPU_TARGETS works in llama.cpp. Regarding the issue of activations quantization, I suspect that it can't be done, but obviously I have to investigate. If the numerical ranges of the activations are too large, W8A8 quantization may produce results that are too imprecise or completely wrong. This might be why no one has done it already :-) Vito

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] Whisper with openblas #52

[feature request] Whisper with openblas #52

bil-ash commented Jan 22, 2024 •

edited

Loading

vitoplantamura commented Jan 27, 2024 via email

bil-ash commented Jan 28, 2024

vitoplantamura commented Jan 28, 2024 via email

[feature request] Whisper with openblas #52

[feature request] Whisper with openblas #52

Comments

bil-ash commented Jan 22, 2024 • edited Loading

vitoplantamura commented Jan 27, 2024 via email

bil-ash commented Jan 28, 2024

vitoplantamura commented Jan 28, 2024 via email

bil-ash commented Jan 22, 2024 •

edited

Loading