# Adapting to _Any_ Bit-Width: Channel-Wise Mixed-Precision Quantization for LLMs

The code is associated with *Adapting to _Any_ Bit-Width: Channel-Wise Mixed-Precision Quantization for LLMs*.

---


## 1. Compute per-layer activation
First, ensure that you have your own LLaMA Hugging Face checkpoint saved at `[MODEL_PATH]`. The following code demonstrates how to compute the layer outputs (activations) for your custom models.
This will save model activations as `[ACT_OUT_PATH]`.

```
python get_layeroutput.py --output_path [ACT_OUT_PATH] --model [MODEL_PATH]
```

## 2. CMPQ quantization
This code contains the implementation of CMPQ. To perform channel-wise quantization, run the following code. The `--bit` argument specifies the bit-precision and can be set to any fractional value between 2 and 4. The `--model_path` and `--gradient arguments` should reference the original model checkpoint and the activations from the previous step, respectively. Our implementation also saves 0.05% quantization-aware outliers. The `--sensitivity` argument controls the fraction of activation-based outliers. You can evaluate the quantized model on the 'wiki' or 'c4' datasets.

```
python CMPQ.py --model_path [MODEL_PATH] --sensitivity 0.45 --model_type llama --device 0 --layerout [ACT_OUT_PATH] --bit 3 --tasks wiki
```
## 3. On-device deployment

In folder `real_deployment`, we provide codes for kernel implementation for CMPQ.