Keywords: Large Language Models, Post-Training Quantization, SVD, Rotation, Massive Activation
TL;DR: FRTQ is a calibration-only W4A4 post-training quantizer that rotates activations and low-ranks weights to minimize the grid-to-stdev ratio, quells massive activation outliers, and matches higher-bit LLM accuracy without fine-tuning.
Abstract: Large language model inference is constrained by memory and latency. Uniform low‑bit quantization would help, but recent evidence shows massive activations—rare, extremely large, and largely input‑invariant per‑token scalars—rather than generic channel‑wise outliers. Methods that “smooth” activation outliers by migrating scale into weights are therefore less effective under this phenomenon. We address this by explicitly rotating activations and preconditioning weights so that both become easy to quantize.
We first identify that the \textbf{grid-to-standard-deviation ratio (GSR)},
$
\rho^X_\text{g} = \frac{\Delta_\text{g}}{\operatorname{std}(X_{\text{c}})},
$
is a useful proxy for quantization sensitivity, as it measures the relative coarseness of quantization steps compared to the intrinsic variability of activations. Building on this insight, we introduce \textbf{Flattened Rotation TSVD Quantization(FRTQ)}, a post-training quantization framework tailored for ultra-low-bit settings (e.g., W4A4). For activations (per-token), FRTQ learns orthogonal rotations at function-invariant points to contract GSR and stabilize quantization. For weights (per-channel), FRTQ fits a rank-$r$ truncated-SVD component to capture dominant directions, quantizes the residual, and realizes the correction via a fused low-rank path. All rotations are folded into adjacent weights, with only a single lightweight on-the-fly rotation required at the FFN down-projection.
By explicitly minimizing GSR, FRTQ aligns its updates with quantization error reduction. The method is purely post-training, requires only a small calibration set, and avoids gradient-based fine-tuning. Its alternating updates are simple, scalable, and kernel-friendly. Experiments across standard LLM backbones show that FRTQ consistently reduces GSR and improves W4A4 accuracy compared to smoothing-only or rotation-only baselines. On LLaMA-2 70B, FRTQ lowers $\rho$ of activation by 28.69\% compared to DFrot, and improves W4A4KV4 zero-shot accuracy by 1.25\%, matching higher-bit baselines while incurring negligible runtime overhead.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 23882
Loading