R2Q: Residual Refinement Quantization for Robust 2-Bit Large Language Models

Jiayi Chen; Jieqi Shi; Jing Huo; Chen Wu

R2Q: Residual Refinement Quantization for Robust 2-Bit Large Language Models

Jiayi Chen, Jieqi Shi, Jing Huo, Chen Wu

17 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Quantization-Aware Training, 2-bit quantization

TL;DR: This paper proposes Residual Refinement Quantization, a plug-and-play method that decomposes 2-bit quantization into two 1-bit subproblems, enabling a flexible quantization lattice, improving gradient stability, and accelerating convergence.

Abstract: The dramatic growth of Large Language Models (LLMs) has been accompanied by significant computational and memory demands, driving the adoption of low-bit quantization. While 8-bit and 4-bit formats have become standard, ultra-low-bit quantization, particularly 2-bit, presents a substantial challenge due to severe accuracy degradation. To address this, we propose Residual Refinement Quantization (R2Q)—a novel 2-bit quantization strategy that decomposes the quantization process into two sequential 1-bit subproblems, enabling adaptive quantization lattice. Extensive experiments on Llama, OPT and Qwen were conducted across diverse benchmarks, including question answering, commonsense reasoning, and language modeling. The results demonstrate that R2Q consistently outperforms state-of-the-art 2-bit quantization baselines in both coarse-grained and fine-grained settings. The refinement-based design of R2Q not only enhances quantization performance but also improves training stability and convergence under aggressive compression. Furthermore, R2Q is modular by design and can be seamlessly integrated into existing quantization-aware training (QAT) pipelines.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 8440

Loading