# EfficientQAT
PyTorch implement of paper EfficientQAT: Efficient Quantization-Aware Training for Large Language Models



## Installation

```
conda create -n efficientqat python==3.11

conda activate efficientqat

pip install -r requirements.txt
```

## Training
EfficientQAT involves two consecutive training phases: Block-wise training of all parameters (**Block-AP**) and end-to-end training of quantization parameters (**E2E-QP**). The detailed training script can be found in `./examples`. We give the training script examples on Llama-2-7B with w2g64 quantization in the following. 

1. Block-AP

You should modify `--model` to the folder of full-precision model  in the script before you running the following command.
```
bash examples/block_ap/Llama-2-7b/w2g64.sh
```
Specifically, the `--weight_lr` is `2e-5` for 2-bit and `1e-5` for 3-/4-bits in our experiments.

Some other important arguments:
- `--train_size`: number of training data samples, 4096 as default
- `--val_size`: number of validation data samples, 64 as default
- `--off_load_to_disk`: save training dataset to disk, saving CPU memory but may reduce training speed


2. E2E-QP

Then, you can load the quantized model of Block-AP for further E2E-QP. Specifically, E2E-QP can adapt to different scenarios by changing the training datasets. You should modify `--quant_model_path` to the folder of quantized model in the script before you running the following command.

1\) Train on RedPajama
```
bash examples/e2e_qp/Llama-2-7b/w2g64-redpajama.sh
```

2\) Train on Alpaca
```
bash examples/e2e_qp/Llama-2-7b/w2g128-redpajama.sh
```
Specifically, the `--learning_rate` is `2e-5` for 2-bit and `1e-5` for 3-/4-bits in our experiments. You can decrease the `--per_device_train_batch_size` to reduce the memory footprint during training, and making sure that `--gradient_accumulation_steps`  increases by the same multiple to maintain the same batch size.





