Keywords: Quantization; Large language Model
Abstract: Rotation-based methods have become essential for state-of-the-art LLM quantization by effectively mitigating outliers in weights and activations. Current approaches predominantly focus on optimizing the global rotation matrix to achieve marginal accuracy improvements—a strategy that incurs prohibitive computational costs through full-model backpropagation while offering limited practical utility.
We fundamentally reassess this optimization paradigm and identify two critical error sources that persist even under optimal rotation conditions: (i) channel mean misalignment, which amplifies rounding errors during quantization, and (ii) clipping-induced energy loss, which is exacerbated by the rotation-induced Gaussian-like distributions. Our analysis reveals that directly addressing these issues offers a more effective path to achieving high quantization accuracy.
Based on these insights, we introduce \textbf{BASE-Q}, a lightweight quantization framework that circumvents expensive global rotation learning. \textbf{BASE-Q} employs simple yet powerful transformer-block-wise correction strategy: \textbf{bias correction} to eliminate channel mean variance and \textbf{asymmetric scaling} to compensate for clipping-induced energy loss. This blockwise strategy drastically reduces optimization overhead, enabling efficient quantization of 70B parameter models on a single GPU.
Extensive experiments across diverse LLMs and benchmarks validate the effectiveness of BASE-Q, narrowing the accuracy gap to full-precision models by 50.5\%, 42.9\%, and 29.2\% compared to previous rotation method QuaRot, SpinQuant, and OSTQuant respectively, demonstrating the superiority of our lightweight paradigm.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 12138
Loading