BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook

Hao Gu; Lujun Li; Hao Wang; LEI WANG; Bei Liu; Zheyu Wang; Jiacheng Liu; Qiyuan Zhu; Sirui Han; Yike Guo

BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook

Hao Gu, Lujun Li, Hao Wang, LEI WANG, Bei Liu, Zheyu Wang, Jiacheng Liu, Qiyuan Zhu, Sirui Han, Yike Guo

03 Sept 2025 (modified: 18 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: language model, model compression, computational efficiency, binary quantization, sub-1-bit compression

TL;DR: Sub 1 bit Large language model Quantization

Abstract: Binary quantization represents the most extreme form of large language model (LLM) compression, reducing weights to $\pm$1 for maximal memory and computational efficiency. While recent sparsity-aware binarization methods achieve sub-1-bit compression by pruning redundant binary weights, they suffer from three critical challenges: performance deterioration, computational complexity from sparse mask management, and limited hardware compatibility. In this paper, we present BTC-LLM, a novel sub-1-bit LLM quantization framework that leverages weight transformation and binary pattern clustering to overcome these limitations, delivering both superior accuracy and efficiency. Our approach incorporates two key innovations: (1) a Flash and Accurate Binary Codebook that identifies recurring binary vector clusters, compressing them into compact indices with tailored distance metrics and sign-based centroid updates; (2) a Learnable Transformation that optimizes invertible scaling and rotation matrices to align binarized weights with full-precision distributions, enabling incoherence processing to enhance layer-wise representation quality. This eliminates the need for sparse masks, enabling efficient inference on standard hardware. Extensive evaluations across LLaMA-1/2/3, Qwen-2.5/3, and FBI-LLM families demonstrate that BTC-LLM establishes a new state-of-the-art for extreme LLM compression at 1.11$\sim$0.7 bits. Notably, our BTC-LLM delivers strong performance under extreme compression settings, with just a 3.1\% accuracy drop on LLaMA-2-13B at 0.8 bits in zero-shot benchmarks while achieving a 1.6$\times$ speedup over FP16. Code is in the Appendix.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 1732

Loading