Understanding the Difficulty of Low-Precision Post-Training Quantization for LLMs

Zifei Xu; Sayeh Sharify; Wanzin Yazar; Tristan J Webb; Xin Wang

Understanding the Difficulty of Low-Precision Post-Training Quantization for LLMs

Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan J Webb, Xin Wang

Published: 05 Mar 2025, Last Modified: 17 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0

Track: tiny / short paper (up to 4 pages)

Keywords: Large Language Models; Quantization; Machine Learning; Natural Language Processing; Fine-tuning; Model Compression

TL;DR: We showed that the difficulty of post-training quantization arose from stark misalignment between optimization of the local and global objective functions and investigated it from a loss landscape perspective.

Abstract: Large language models of high parameter counts are computationally expensive, yet can be made much more efficient by compressing their weights to very low numerical precision. This can be achieved either through post-training quantization by minimizing local, layer-wise quantization errors, or through quantization-aware fine-tuning by minimizing the global loss function. In this study, we discovered that, under the same data constraint, the former approach nearly always fared worse than the latter, a phenomenon particularly prominent when the numerical precision is very low. We further showed that this difficulty of post-training quantization arose from stark misalignment between optimization of the local and global objective functions. Our findings suggested limited utility in minimization of local quantization error and the importance of direct quantization-aware fine-tuning, in the regime of large models at very low precision.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 19

Loading