Keywords: LLM inference, Quantization, Attention
Abstract: Quantization has been widely adopted in LLM training and inference pipelines, delivering substantial gains in both cost and efficiency.
However, existing low-bit quantization for the \emph{attention} module often introduces large quantization errors at very low bit-widths, leading to noticeable performance degradation .
Prior work has mostly focused on smoothing techniques to mitigate outliers, whereas we emphasize a \emph{mixed-precision} design.
Motivated by extensive heatmap analyses, we observe that LLM attention patterns typically contain a \emph{very small set of dominant vertical lines} that carry a disproportionate amount of attention mass.
Accordingly, we preserve this small but crucial subset in full precision during quantized computation. Specifically, we propose \textbf{VQuant}.
In the \textbf{Prefill} stage, we design a new quantized attention operator: during computation, we keep the identified vertical-line positions and a local sliding window in full precision, while quantizing the remaining parts to low bit-width.
In the \textbf{Decode} stage, we follow the same principle: when quantizing the KV cache, we keep the vertical lines and the local window unquantized, and further fuse KV dequantization with attention computation to improve hardware efficiency.
Empirically, in the \textbf{Prefill} stage, VQuant reduces MSE by about \textbf{5$\times$} with few extra computation overhead. In the \textbf{Decode} stage, VQuant combines KV quantization with fused attention, preserving end-to-end quality across benchmarks while achieving up to \textbf{3.58$\times$} speedup.
With the two-stage co-design, VQuant achieves near lossless quality in end-to-end evaluations.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Efficient/Low-Resource Methods for NLP
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 9716
Loading