Towards W2A4 LLM Inference: Hybrid SQ-VQ Framework with Adaptive Error Compensation

ICLR 2026 Conference Submission811 Authors

02 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Quantization
Abstract: Quantization presents a powerful approach for reducing the memory footprint and accelerating the inference of Large Language Models (LLMs). However, it faces a fundamental dilemma: computation-friendly Scalar Quantization (SQ) suffers performance degradation at ultra-low bit-widths, whereas memory-friendly Vector Quantization (VQ) maintains higher accuracy but fails to reduce computational demand. As a result, achieving both computational efficiency and high-fidelity compression in ultra-low-bit regimes (e.g.W2A4) remains a tough challenge. To address this, we propose $\textbf{AEC-SVQ}$, a hybrid framework that synergistically integrates SQ ,VQ for high-performance, ultra-low-bit LLM inference. The framework is built on three innonvations. To simultaneously address the disparate distributional challenges presented by weight VQ, activation SQ, and codebook integer quantization, we introduce a $\textbf{learned rotation-smooth transformation}$ that adaptively promotes quantization-friendly distributions for weights, activations, and codebooks within the hybrid SQ–VQ scheme. To mitigate the compounding errors caused by the independent quantization of weights and activations, we propose the $\textbf{Cumulative-Error-Aware Vector Quantization (CEAVQ) algorithm}$. CEAVQ adjusts weights to compensate for the cumulative error from upstream quantized layers, thereby proactively aligning with the full-precision output distribution. To ensure robustness against statistical noise from limited calibration data, we introduce a closed-form, data-driven $\textbf{Adaptive Compensation}$. It modulates the compensation strength for cumulative errors, preventing overfitting to calibration set statistics and guaranteeing stable generalization. AEC-SVQ enables a W2A4 pipeline that achieves the memory footprint of a 2-bit model while exploiting the computational efficiency of 4-bit integer arithmetic. On LLaMA-30B, it delivers a 3.6$\times$ speedup and 7.1$\times$ memory saving, establishing a practical frontier for ultra-low-bit LLM deployment.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 811
Loading