Abstract: Supervised fine-tuning is a standard method for adapting pre-trained large language models (LLMs) to downstream tasks. Quantization has been recently studied as a post-training technique for efficient LLM deployment. To obtain quantized fine-tuned LLMs, conventional pipelines would first fine-tune the pre-trained models, followed by post-training quantization. This often yields suboptimal performance as it fails to leverage the synergy between fine-tuning and quantization. To effectively realize low-bit quantization of weights, activations and KV caches in LLMs, we propose an algorithm named Rotated Straight-Through-Estimator (RoSTE), which combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that identifies an effective rotation configuration to reduce activation outliers. We provide theoretical insights on RoSTE by analyzing its prediction error when applied to an overparameterized least square quantized training problem. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration. Experiments on Pythia, Qwen and Llama models of different sizes demonstrate the effectiveness of RoSTE. Compared to existing post-SFT quantization baselines, our method consistently achieves superior performances across various tasks and different LLM architectures. Our code is available at https://github.com/OptimAI-Lab/RoSTE.
Lay Summary: Large language models like ChatGPT are powerful but require a lot of memory and computing power, making them difficult to use on smaller devices. One way to make these models more efficient is through quantization, which compresses the model by using fewer bits to store numbers. However, quantizing a model after it’s already trained often hurts its performance.
Our research introduces a smarter way: we fine-tune and compress the model at the same time using a new method called RoSTE. RoSTE learns how to rotate and adjust the model’s data in a way that reduces errors caused by compression. This makes the model smaller and faster while keeping its performance high.
We tested RoSTE on popular models and tasks and found that it consistently worked better than existing compression methods. This approach helps make AI models more accessible and usable on a wider range of devices.
Link To Code: https://github.com/OptimAI-Lab/RoSTE
Primary Area: Deep Learning->Large Language Models
Keywords: model quantization, quantization-aware training, fine-tuning, large language models
Submission Number: 15164
Loading