WiniQ: Accelerating Quantization-Aware Training of LLMs around Saddle Points

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language model quantization, low-precision quantized training, Hessian
Abstract: Quantization-aware training is a widely used approach for language model quantization in sub-4-bit precision. This approach works by training full-precision weights to minimize the loss with gradients on the quantized model. Despite its superior performance, the main bottleneck for this quantized training is its slow convergence, which gets worse in lower bit-widths. While this problem has been observed in prior work, its precise cause has not been carefully studied. In this paper, we analyze the convergence by computing the Hessian spectrum of the model loss throughout quantization-aware training. We find the key reason is that the model weights converge to flat surfaces near saddle points with a large fraction of Hessian eigenvalues concentrated around zero, and the magnitude of both positive and negative eigenvalues decreases over training. Additionally, the convergence speed is slower in lower bit-widths with significantly smaller magnitude of loss Hessian eigenvalues. Motivated by these findings, we propose an approach to accelerate quantized training with minimal overhead named WiniQ. The key technique in WiniQ is periodical weight re-initialization by linear interpolation between the full-precision and quantized weights. This interpolation resets the weights to regions with larger (magnitude) Hessian eigenvalues without increasing the loss. We further use noise injection to regularize the Hessian, resulting in an algorithm that is broadly applicable to quantization methods. Extensive experiments show that WiniQ accelerates various quantized training methods by up to **4**$\times$. Under the same training budget as prior training methods, WiniQ improves state-of-the-art sub-4-bit quantization performance by up to **8.8**% relatively. Additionally, WiniQ remains consistently effective across 16 settings of different language models, quantization methods, and bit-widths.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6844
Loading