Simplifying the quantization pipeline #9

kamalojasv181 · 2023-03-20T04:44:32Z

The quantization pipeline seems very hard to use. Besides manually adding support for popular models, I think it would be a good idea if we could further automate the quantization pipeline.

As far as I can think, we only need a dict mapping the names of layers in the original model and the layers in the gpt model and we can use the same script to handle any architecture.

Thoughts @tejasvaidhyadev @Ayushk4 ?

Ayushk4 · 2023-03-20T05:27:06Z

This is true. A lot of the code needs refactoring. We need to make it easy to add new models.

They good thing is that once we add support for any major model (like GPT-J), it becomes very easy to add support for its derivatives (like GPT-JT).

I would greatly welcome suggestions on how can we improve on this.

A2va · 2023-03-20T21:49:22Z

I'm wonder if it's possible to do the whole quantization process in the python conversion script. I feel like this is much simpler than a two-step process with two different programs.

Ayushk4 · 2023-03-21T01:31:39Z

That's a good suggestion @A2va .

Do you have any suggestions on how we can quantize and save in a fast manner in python?

kamalojasv181 · 2023-03-21T01:42:23Z

I dont think we need python for that. Like we have all the weights saved in the ggml model. We just need the computation graph. Onnx does this by doing a forward pass and saving a static graph. We could potentially do something like that or perhaps start with onnx itself.

Ayushk4 · 2023-03-21T02:09:36Z

ONNX is a general purpose - ggml does not support all the operations like slicing and all. If we go that route, then we will have to add support for reading their computation graph and map it to GGML computation graph, write computation-graph specific rules to substitute for the missing operations. If it is worth it in the long run, we could do it. But, it will take a long time to get something tangible - a minimal viable prototype of converting ONNX computation graph to GGML for a single model.

A2va · 2023-03-21T21:12:49Z

Do you have any suggestions on how we can quantize and save in a fast manner in python?

Not directly, but I found this script in llma.cpp which take a already quantized pytorch model and convert it to a ggml model.

Quantization of LLaMa and OPT model in python: https://github.com/qwopqwop200/GPTQ-for-LLaMa

I have no idea if this is fast, but it's certainly slower than the C++ version.
I had no idea until I read the README that GPTQ and Int4 quantization is different. So which of those methods the cpp programs use to quantize ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplifying the quantization pipeline #9

Simplifying the quantization pipeline #9

kamalojasv181 commented Mar 20, 2023

Ayushk4 commented Mar 20, 2023

A2va commented Mar 20, 2023

Ayushk4 commented Mar 21, 2023

kamalojasv181 commented Mar 21, 2023

Ayushk4 commented Mar 21, 2023 •

edited

Loading

A2va commented Mar 21, 2023 •

edited

Loading

Simplifying the quantization pipeline #9

Simplifying the quantization pipeline #9

Comments

kamalojasv181 commented Mar 20, 2023

Ayushk4 commented Mar 20, 2023

A2va commented Mar 20, 2023

Ayushk4 commented Mar 21, 2023

kamalojasv181 commented Mar 21, 2023

Ayushk4 commented Mar 21, 2023 • edited Loading

A2va commented Mar 21, 2023 • edited Loading

Ayushk4 commented Mar 21, 2023 •

edited

Loading

A2va commented Mar 21, 2023 •

edited

Loading