Keywords: Adaptive Rounding, Large Language Model, Quantization
Abstract: Large language models (LLMs) quantization predominantly relies on round-to-nearest (RTN) operations as the atomic operation to map floating point (FP) weights into quantization grids. Applied at tensor-, group-, or channel-level granularities, such non-element-wise rounding is sub-optimal, as it prevents error cancellation across elements. Adaptive rounding addresses this by assigning each weight an optimized rounding parameter, but existing methods introduce an auxiliary matrix of equal size to the weights, substantially inflating computation and memory costs. Thus, we propose VQRound, which re-parameterizes the rounding matrix via vector quantization (VQ) into a compact codebook, drastically reducing trainable variables while preserving quantization fidelity. We identify the critical role of the initialization of the rounding matrix, as a proper scheme minimizes the deviation from the FP model and facilitates efficient tuning of the rounding parameters. Beyond naive layer- or block-wise optimization, we introduce a lightweight end-to-end finetuning pipeline that requires only 128 samples and enables global optimization of codebooks across all layers. Moreover, VQRound can be used as a plug-and-play replacement for atomic rounding, complementing existing quantization techniques to further enhance accuracy. Experiments on billion-parameter models, including OPT, LLaMA, and Qwen, show that VQRound achieves competitive performance under 4-bit, 3-bit, or even 2-bit quantization with as few as 0.2\% of the learnable parameters of prior adaptive rounding methods.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 5996
Loading