BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

Junyu Chen; Jungang Li; Jing Xiong; Wenjie Wang; Qingyao Yang; He Xiao; Zhen Li; Taiqiang Wu; Mengzhao Chen; Zhen Peng; Chaofan Tao; Long Shi; Hongxia Yang; Ngai Wong

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

Junyu Chen, Jungang Li, Jing Xiong, Wenjie Wang, Qingyao Yang, He Xiao, Zhen Li, Taiqiang Wu, Mengzhao Chen, Zhen Peng, Chaofan Tao, Long Shi, Hongxia Yang, Ngai Wong

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: BPDQ expands the feasible solution set of optimal-PTQ by constructing a variable grid, preserving high fidelity in the low-bit regime.

Abstract: Large language model inference is often bounded by memory footprint and bandwidth in resource-constrained deployments, making quantization fundamental to efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. In essence, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using second-order information while progressively compensating for quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85\% GSM8K accuracy (vs. 90.83\% at 16-bit). Moreover, we theoretically show that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. The code is available at github.com/KingdalfGoodman/BPDQ.

Lay Summary: Today's LLMs are remarkably intelligent, but running them takes expensive hardware. To make these models smaller, researchers compress them by storing each internal number with fewer digits, similar to how saving a photo at lower quality shrinks the file. This works fine at moderate compression, but breaks down when each number is squeezed into only 2 or 3 digits, where the model starts producing nonsense. The reason is that existing methods force every number to snap onto a fixed, evenly-spaced grid, which fits the data poorly. Our method instead allows this grid to change shape, enlarging the range of possible compressions, so a far better one becomes reachable even at very low bit-widths. We then iteratively refine this compression using information about how errors propagate through the model, correcting earlier mistakes. With our method, a 72B language model that normally needs server-grade GPUs can now run on a single consumer RTX 3090, while still scoring 83.85% on GSM8K (grade-school math), compared to 90.83% for the uncompressed model.

Originally Submitted Supplementary Material: zip

Link To Code: https://github.com/KingdalfGoodman/BPDQ

Primary Area: Deep Learning->Large Language Models

Keywords: Post-Training Quantization, Large Language Models, Model Compression, Bit-Plane Decomposition

Originally Submitted PDF: pdf

Submission Number: 6938

Loading