Efficient Fine-Tuning of Quantized LLMs via Three-Stage Optimization

Changhai Zhou; Shijie Han; Yuhua Zhou; Shiyang Zhang; Qian Qiao; Andrei Simion; WEIZHONG ZHANG

Efficient Fine-Tuning of Quantized LLMs via Three-Stage Optimization

Changhai Zhou, Shijie Han, Yuhua Zhou, Shiyang Zhang, Qian Qiao, Andrei Simion, WEIZHONG ZHANG

14 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient Fine-Tuning, NLP, Iterative Optimization, Layer-wise Quantization and Low-Rank Configuration

TL;DR: Method of obtaining a high-performance low-precision model.

Abstract: To address the memory consumption and computational efficiency issues in fine-tuning large language models (LLMs), Parameter-Efficient Fine-Tuning (PEFT) and quantization have emerged. Recent studies have combined the two and have proposed adjusting parameters before fine-tuning to reduce quantization errors, aiming to improve fine-tuning performance. We find that the performance of fine-tuning on the adjusted quantized models is even worse than using the original quantized models directly, as the adjusted model is essentially a completely different model from the original quantized model. Additionally, we have discovered that due to the poor robustness of quantized models, increasing the training difficulty may result in even worse outcomes. To address this, we propose two constraints for fine-tuning quantized models, and based on these, we introduce a general fine-tuning framework called QR-Adaptor. This framework bypasses the network errors introduced by quantization and directly uses actual performance and memory as optimization targets. Through initialization, extrapolation, and interpolation, it quickly solves this gradient-free optimization problem. Experimental results demonstrate that our method yields fine-tuned low-bit quantized models that outperform fine-tuned 16-bit models while maintaining the same memory usage as fine-tuning 4-bit models. For example, in the zero-shot test on MMLU, it improves accuracy by 3.3\% over both LoftQ and LQ-LoRA.

Supplementary Material: zip

Primary Area: transfer learning, meta learning, and lifelong learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 726

Loading