Keywords: Post-training quantization, efficient inference, model compression
Abstract: Post-training weight quantization is an effective method for reducing the serving cost of large language models.
However, standard round-to-nearest quantization often introduces large errors due to outliers in the weights.
Proposed mitigation mechanisms include applying adaptive rounding, random rotation transformations or committing to a post-training target using calibration data. Unfortunately, this reliance on calibration data can be severely limiting due to data inavailability, or additional computational overhead. In this paper, we propose algorithms to optimize transformations and adaptive rounding without access to *any* calibration data. The optimization is achieved by designing a suitable proxy function for the quantization loss without calibration data. To maintain inference efficiency, we perform structured matrix transformations for single matrices. For paired weights that interact directly in the computation graph, we use dual matrix transformations and adaptive rounding methods. We conduct experiments on Gemma and Qwen models. Compared to baseline methods without calibration data, we observe consistent improvement across various benchmarks and quantization levels. Compared to SOTA calibration-based methods, for 4-bit quantization on Qwen2.5-7B model, our method achieves comparable performance with $<0.2$ reduction in WikiText-2 and C4 perplexities and even better average score on zero-shot benchmarks, demonstraing the effectiveness of calibration-free quantization.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 166
Loading