Keywords: language model, model compression, computational efficiency, QAT, quantization aware training
TL;DR: progressive QAT for low bit LLM
Abstract: Training large language models (LLMs) at ultra–low precision remains challenging: direct low-bit quantization-aware training (QAT) often suffers from slow convergence that demands substantial training budgets, as well as quantization errors arising from heavy-tailed outlier channels and the accumulation of errors across layers. To address these issues, we present \textsc{Bit-by-Bit}, a progressive QAT framework with outlier channel splitting. Our approach integrates three key components: (1) block-wise progressive training that reduces precision stage by stage, ensuring stable initialization for low-bit optimization; (2) rounding-aware outlier channel splitting, which mitigates quantization error while acting as an identity transform that preserves the quantized outputs; and (3) microscaling groups with E4M3 scales to capture dynamic activation ranges aligned with OCP/NVIDIA practices. Furthermore, we exploit the nested structure of integer quantization grids to enable a single-run, once-for-any-precision model that can be directly deployed at multiple bit-widths without retraining.
We conduct comprehensive evaluations under both weight-only and weight–activation quantization settings. Under W2A2 quantization, Bit-by-Bit narrows the perplexity gap with full-precision models on WikiText2 to just 2.25, consistently outperforming BitDistiller by 24.19 and EfficientQAT by 20.59 on Llama2-7b. Moreover, on the Llama3 family—known for its quantization difficulty, Bit-by-Bit surpasses other QAT baselines.
Code is available in the Appendix.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1749
Loading