Keywords: Model Quantization, Efficient Fine-Tuning, Saliency-Guided Rescaling, Large Language Models
Abstract: The prohibitive computational and memory demands of Large Language Models (LLMs) necessitate quantization techniques. However, Post-Training Quantization (PTQ) methods suffer significant performance degradation at low bit-widths (e.g., 4-bit or lower), while Quantization-Aware Training (QAT) is resource-intensive and impractical for billion-scale models. Recent Quantization for Parameter-Efficient Fine-Tuning (Q-PEFT) approaches, integrating scale or adapter tuning with PTQ, offer a compromise but still face accuracy collapse and high costs under low bit-widths. We identify special phenomena that fine-tuning part of salient parameters (scales) can achieve a better performance than full parameter tuning for low bit-widths quantization, while the parameter tuning ratio is related to inflection point of the Cumulative Distribution Function (CDF) based on the Hessian matrix. Based on the observation, we introduce a simple yet effective method to identify salient weights that contribute more to representational fidelity, and accordingly propose a new quantization framework Saliency-Guided Rescaling (SGR-Q). SGR-Q introduces sparsity-Hessian based identification of salient weight columns, and selectively fine-tunes their quantization scales while freezing other parameters. Our scheme by fine-tuning only part of the scale parameters can help retain more pretrained model's generalization ability than fine-tuning all the scale parameters. Extensive experiments validate the superior performance of SGR-Q comparing with PTQ and Q-PEFT methods across benchmarks. For instance, selectively tuning only the top-40\% salient scales achieves 2.5\% higher average accuracy on six commonsense reasoning datasets with 60\% lower tuning costs compared to the state-of-the-art full-scale tuning method PEQA.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 6675
Loading