Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add support for optimum-quanto #2000

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions docs/source/developer_guides/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,27 @@ model = get_peft_model(base_model, peft_config)
- DoRA only works with `quant_type = "int8_weight_only"` at the moment.
- There is explicit support for torchao when used with LoRA. However, when torchao quantizes a layer, its class does not change, only the type of the underlying tensor. For this reason, PEFT methods other than LoRA will generally also work with torchao, even if not explicitly supported. Be aware, however, that **merging only works correctly with LoRA and with `quant_type = "int8_weight_only"`**. If you use a different PEFT method or dtype, merging will likely result in an error, and even it doesn't, the results will still be incorrect.

## Optimum-quanto

PEFT supports models quantized with [optimum-quanto](https://github.com/huggingface/optimum-quanto). This has been tested with 2bit, 4bit, and 8bit int quantization. Optimum-quanto also works on CPU and MPS.

```python
from transformers import AutoModelForCausalLM, QuantoConfig

model_id = ...
quantization_config = QuantoConfig(weights="int4") # or qint2 or qint8
base_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
peft_config = LoraConfig(...)
model = get_peft_model(base_model, peft_config)
```

### Caveats:

- Use optimum-quanto v0.2.5 or above, otherwise saving and loading won't work properly.
- If you want to use optimum-quanto via transformers, install transformers v4.46.0 or above.
- Float8 is discouraged as it can easily produce NaNs.
- There is explicit support for optimum-quanto when used with LoRA. However, when optimum-quanto quantizes a layer, it remains a subclass of the corresponding torch class (e.g., quanto's `QLinear` is a subclass of `nn.Linear`). For this reason, non-LoRA methods will generally also work with optimum-quanto, even if not explicitly supported. Be aware, however, that **merging only works correctly with LoRA**. If you use a method other than LoRA, merging may not raise an error but the results will be incorrect.

## Other Supported PEFT Methods

Besides LoRA, the following PEFT methods also support quantization:
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
"scipy",
"protobuf",
"sentencepiece",
"optimum-quanto",
]

setup(
Expand Down
7 changes: 7 additions & 0 deletions src/peft/import_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,3 +103,10 @@ def is_torchao_available():
f"but only versions above {TORCHAO_MINIMUM_VERSION} are supported"
)
return True


@lru_cache
def is_quanto_available():
return (importlib.util.find_spec("optimum") is not None) and (
importlib.util.find_spec("optimum.quanto") is not None
)
2 changes: 2 additions & 0 deletions src/peft/tuners/lora/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@
from .gptq import dispatch_gptq
from .hqq import dispatch_hqq
from .layer import Conv2d, LoraLayer, dispatch_default
from .quanto import dispatch_quanto
from .torchao import dispatch_torchao
from .tp_layer import dispatch_megatron

Expand Down Expand Up @@ -343,6 +344,7 @@ def dynamic_dispatch_func(target, adapter_name, lora_config, **kwargs):
dispatch_gptq,
dispatch_hqq,
dispatch_torchao,
dispatch_quanto,
dispatch_megatron,
dispatch_default,
]
Expand Down
Loading
Loading