Lifted Uniform Quantization for Extreme Low-bit Large Language models

Liulu He; Xuan Ang Liu; Taolue Feng; Ting Lu; Chunsheng Gan; ZHIYV PENG; Yuan Du; Huanrui Yang; Yijiang Liu; Li Du

Lifted Uniform Quantization for Extreme Low-bit Large Language models

Liulu He, Xuan Ang Liu, Taolue Feng, Ting Lu, Chunsheng Gan, ZHIYV PENG, Yuan Du, Huanrui Yang, Yijiang Liu, Li Du

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Quantization; Large language Model

Abstract: Pushing large language models to extreme low bit-widths (e.g., 2-bit) is a critical frontier for efficient deployment, yet it presents a daunting challenge to preserving model accuracy. Current methods are trapped in a fundamental trade-off: Vector Quantization (VQ) maintains accuracy by learning expressive codebooks but is crippled by its computationally expensive, non-parallelizable lookup operations. Conversely, Uniform Quantization (UQ) is exceptionally efficient but suffers a precipitous drop in quality at such low bit-widths. To break this impasse, we propose Lifted Uniform Quantization (LiftUQ), a new paradigm that encodes weights in an expanded latent space using ultra-low-bit uniform quantization (1-bit in our practice), and then applies a trainable dimensionality reduction linear transformation to project them into the original space, forming non-uniform codepoints without any look-up codebook. This lifted–projected representation recovers and even surpasses the expressive power of vector quantization while retaining the decoding efficiency of scalar uniform quantization. To make LiftUQ applicable to arbitrary layers, we further learn a whitening transform to produce approximately independent Gaussian-like channels, then apply the same lifted–projected encoding. LiftUQ marks a significant breakthrough in extreme low-bit quantization. Our experiments validate that it is the \textbf{first framework to bridge the long-standing accuracy gap between uniform and vector quantization}, consistently matching or surpassing VQ performance on Llama and Qwen models—for instance, suffering less than a 2.7/1.1-point accuracy degradation on Llama-3-70B at 2/3-bit. Critically, this high accuracy is achieved with exceptional efficiency, boosting throughput up to 6.7$\times$ over FP16 by combining the inherent speed of uniform decoding with a lightweight linear projection. This establishes LiftUQ a new, superior paradigm for practical quantization.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 1241

Loading