Enhancing ISQ with an imatrix

Mistral.rs supports enhancing the performance of models quantized with ISQ by collecting an imatix from calibration data. The following quantizations are supported with an imatrix:

Q2K
Q3K
Q4K
Q5K
Q6K

What is an imatrix? An imatrix (importance matrix) is generated from data collected during the execution of the model on calibration data. This data is used to enhance the performance of the model by enabling a weighted RMSE minimization when quantizing the tensor. For more information, see the original PR.

Using an imatrix causes the quantization process to take longer as the data must be collected, but there is no inference-time performance decrease.

To use this, simply specify the calibration data file in the various APIs as detailed below.

With the CLI

./mistralrs-server -i --isq Q4K plain -m meta-llama/Llama-3.2-3B-Instruct --calibration-file calibration_data/calibration_datav3_small.txt

With the Rust API

You can find this example here.

let model = TextModelBuilder::new("meta-llama/Llama-3.2-3B-Instruct")
    .with_isq(IsqType::Q4K)
    .with_calibration_file("calibration_data/calibration_datav3_small.txt".into())
    .with_logging()
    .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
    .build()
    .await?;

With the Python API

You can find this example here.

runner = Runner(
    which=Which.Plain(
        model_id="meta-llama/Llama-3.2-3B-Instruct",
        calibration_file="calibration_data/calibration_datav3_small.txt"
    ),
    in_situ_quant="Q4K",
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IMATRIX.md

IMATRIX.md

Enhancing ISQ with an imatrix

With the CLI

With the Rust API

With the Python API

Files

IMATRIX.md

Latest commit

History

IMATRIX.md

File metadata and controls

Enhancing ISQ with an imatrix

With the CLI

With the Rust API

With the Python API