-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] Whisper with openblas #52
Comments
hi,
sorry for the late answer.
XNNPACK provides a set of operators for quantized operations (including
8-bit operations) as well. It may seem counterintuitive, but making "fast"
8-bit operators is more complex than making fast float operators, for
example.
Regarding OpenBLAS, I don't know: I was actually thinking of hipBLAS (for
AMD GPUs), since the cost in terms of code changes should be almost zero.
However I will take a look at OpenBLAS to understand how complex the
integration would be.
I'm no Whisper expert, but with projects like "insanely-fast-whisper",
would it make sense? However the idea of "Whisper on the Raspberry PI
Zero 2" could be more interesting, perhaps :-)
Thanks, Vito
|
Okay, got the point regarding XNNPACK. I guess you should implement hipBLAS first because you will have to make minimal changes. Just please allow overiding the GPU using AMDGPU_TARGETS like llama.cpp . Actually, I have a machine with AMD GPU which is not officially supported by hipBLAS and overriding helps me to run llama.cpp (with better performance than CPU) but I can't run anything which dosen't support overriding GPU , for example the onnxstream llm demo with GPU. So, I was asking for OpenBLAS support. insanely-fast-whisper is aimed for servers with GPU. You could aim for CPU. Also, till now whatever whisper quantization for CPU inference implementations I have seen, all do only weight quantization. You could do both weight and activation quantization thereby reducing disk and memory usage. |
ok, got it.
I will definitely look at how AMDGPU_TARGETS works in llama.cpp.
Regarding the issue of activations quantization, I suspect that it can't be
done, but obviously I have to investigate. If the numerical ranges of the
activations are too large, W8A8 quantization may produce results that are
too imprecise or completely wrong. This might be why no one has done it
already :-)
Vito
|
First of all thanks for this cool project. Now since you have started adding support for models other than stable diffusion, please also add support for whisper with W8A8 quantization.
Also, seems xnnpack is for speeding up float operations. So does that mean for W8A8 inference xnnpack is not required?
Also, consider adding openblas as a drop-in replacement for cublas so that gpu acceleration can also be used on intel and AMD CPUs with integrated graphics.
The text was updated successfully, but these errors were encountered: