ZipperQuant: Bit-Based Inlier–Outlier Disaggregation for 4-Bit LLMs on GPU–CPU

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Quantization, LLM, Inference Acceleration
TL;DR: 4-Bit Post-Training Quantization for Hybrid GPU–CPU LLM Inference
Abstract: Quantization has become a key technique for reducing the memory footprint and accelerating the inference of large language models (LLMs), especially as modern GPUs provide native INT4 compute units. However, quantizing both weights and activations to low bit-width often causes substantial accuracy degradation, since the limited representational range cannot simultaneously capture common values (\emph{inliers}) and rare large-magnitude values (\emph{outliers}). To address this challenge, we propose \textbf{ZipperQuant}, a novel 4-bit quantization paradigm for LLMs that disaggregates the computation of inliers and outliers across GPU and CPU. A key limitation of naive value-based disaggregation—offloading entire outlier values to the CPU—is that it suffers from the large performance gap between GPUs and CPUs and the high overhead of inter-device data transfer. ZipperQuant instead introduces a bit-based disaggregation strategy. Using smoothing, activation outliers are first absorbed into the weights, which are then split into low-order and high-order components. The GPU executes all inliers together with the low-order bits of outliers in low precision, while only the sparse high-order bits are offloaded to the CPU and multiplied with activations at high precision. These high-order CPU computations are further accelerated by a specialized lookup-table mechanism: since only a small set of bit patterns occurs, their results can be precomputed and reused, replacing costly multiplications with lightweight table lookups and accumulation, while also eliminating dequantization overhead. Extensive experiments demonstrate that ZipperQuant preserves near-FP16 accuracy while achieving up to $3.01\times$ speedup over a W4A16 baseline on an RTX 4090 with INT4 precision.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11478
Loading