Keywords: Large Language Models, Quantization-Aware Training, 2-bit quantization
TL;DR: This paper proposes Residual Refinement Quantization, a plug-and-play method that decomposes 2-bit quantization into two 1-bit subproblems, enabling a flexible quantization lattice, improving gradient stability, and accelerating convergence.
Abstract: The dramatic growth of Large Language Models (LLMs) has been accompanied by significant computational and memory demands, driving the adoption of low-bit quantization. While 8-bit and 4-bit formats have become standard, ultra-low-bit quantization, particularly 2-bit, presents a substantial challenge due to severe accuracy degradation. To address this, we propose Residual Refinement Quantization (R2Q)—a novel 2-bit quantization strategy that decomposes the quantization process into two sequential 1-bit subproblems, enabling adaptive quantization lattice. Extensive experiments on Llama, OPT and Qwen were conducted across diverse benchmarks, including question answering, commonsense reasoning, and language modeling. The results demonstrate that R2Q consistently outperforms state-of-the-art 2-bit quantization baselines in both coarse-grained and fine-grained settings. The refinement-based design of R2Q not only enhances quantization performance but also improves training stability and convergence under aggressive compression. Furthermore, R2Q is modular by design and can be seamlessly integrated into existing quantization-aware training (QAT) pipelines.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8440
Loading