BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: BPDQ expands the feasible solution set of optimal-PTQ by constructing a variable grid, preserving high fidelity in the low-bit regime.
Abstract: Large language model inference is often bounded by memory footprint and bandwidth in resource-constrained deployments, making quantization fundamental to efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. In essence, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using second-order information while progressively compensating for quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85\% GSM8K accuracy (vs. 90.83\% at 16-bit). Moreover, we theoretically show that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. The code is available at github.com/KingdalfGoodman/BPDQ.
Lay Summary: Today's LLMs are remarkably intelligent, but running them takes expensive hardware. To make these models smaller, researchers compress them by storing each internal number with fewer digits, similar to how saving a photo at lower quality shrinks the file. This works fine at moderate compression, but breaks down when each number is squeezed into only 2 or 3 digits, where the model starts producing nonsense. The reason is that existing methods force every number to snap onto a fixed, evenly-spaced grid, which fits the data poorly. Our method instead allows this grid to change shape, enlarging the range of possible compressions, so a far better one becomes reachable even at very low bit-widths. We then iteratively refine this compression using information about how errors propagate through the model, correcting earlier mistakes. With our method, a 72B language model that normally needs server-grade GPUs can now run on a single consumer RTX 3090, while still scoring 83.85% on GSM8K (grade-school math), compared to 90.83% for the uncompressed model.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/KingdalfGoodman/BPDQ
Primary Area: Deep Learning->Large Language Models
Keywords: Post-Training Quantization, Large Language Models, Model Compression, Bit-Plane Decomposition
Originally Submitted PDF: pdf
Submission Number: 6938
Loading