-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplifying the quantization pipeline #9
Comments
This is true. A lot of the code needs refactoring. We need to make it easy to add new models. They good thing is that once we add support for any major model (like GPT-J), it becomes very easy to add support for its derivatives (like GPT-JT). I would greatly welcome suggestions on how can we improve on this. |
I'm wonder if it's possible to do the whole quantization process in the python conversion script. I feel like this is much simpler than a two-step process with two different programs. |
That's a good suggestion @A2va . Do you have any suggestions on how we can quantize and save in a fast manner in python? |
I dont think we need python for that. Like we have all the weights saved in the ggml model. We just need the computation graph. Onnx does this by doing a forward pass and saving a static graph. We could potentially do something like that or perhaps start with onnx itself. |
ONNX is a general purpose - ggml does not support all the operations like slicing and all. If we go that route, then we will have to add support for reading their computation graph and map it to GGML computation graph, write computation-graph specific rules to substitute for the missing operations. If it is worth it in the long run, we could do it. But, it will take a long time to get something tangible - a minimal viable prototype of converting ONNX computation graph to GGML for a single model. |
Not directly, but I found this script in llma.cpp which take a already quantized pytorch model and convert it to a ggml model. Quantization of LLaMa and OPT model in python: https://github.com/qwopqwop200/GPTQ-for-LLaMa I have no idea if this is fast, but it's certainly slower than the C++ version. |
The quantization pipeline seems very hard to use. Besides manually adding support for popular models, I think it would be a good idea if we could further automate the quantization pipeline.
As far as I can think, we only need a dict mapping the names of layers in the original model and the layers in the gpt model and we can use the same script to handle any architecture.
Thoughts @tejasvaidhyadev @Ayushk4 ?
The text was updated successfully, but these errors were encountered: