BAQ: Efficient Bit Allocation Quantization for Large Language Models

BAQ: Efficient Bit Allocation Quantization for Large Language Models

ICLR 2026 Conference Submission18920 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: model compression, post training quantization, weight-only quantization, bit allocation

Abstract: Post-training model quantization is a widely adopted technique for reducing the memory and computational costs of large language models (LLMs). However, most existing methods either fix a uniform bitwidth or rely on binary sensitivity groupings (``sensitive'' vs.\ ``non-sensitive'') that treat all weights within a group identically, ignoring how sensitive each weight actually is and leaving the degree of sensitivity under-exploited. To address this, for the first time in the neural network quantization literature, we introduce an explicit loss--bitwidth relation that links layer-output distortion to the assigned precision, together with a sensitivity-guided bit-allocation quantization (BAQ) framework. Under mild assumptions, this modeling makes the layer-wise loss an explicit function of quantization bitwidth and yields a convex resource-allocation problem with a \emph{closed-form} solution that adapts precision across weights. This choice is theoretically motivated by rate–distortion theory and validated by extensive simulations. Inspecting the solution of the proposed resource-allocation problem provides several insights (such as the equal-loss structure), which are then exploited to design the proposed algorithm. The proposed algorithm achieves a good trade-off between loss minimization and complexity and allows BAQ to be integrated into standard quantization pipelines with minimal overhead. Experimental results show that BAQ consistently outperforms GPTQ, achieving up to 56$\times$ lower perplexity at the same bitwidth on large language models (e.g., OPT, Llama) ranging from 125M to 30B parameters. Leveraging our analytical results derived from solving the optimal bit allocation problem, we also provide a theoretical explanation for the observed gains.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 18920

Loading