VQuant: Vertical-Line-Aware Mixed-Precision Quantization

VQuant: Vertical-Line-Aware Mixed-Precision Quantization

ACL ARR 2026 January Submission9716 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM inference, Quantization, Attention

Abstract: Quantization has been widely adopted in LLM training and inference pipelines, delivering substantial gains in both cost and efficiency. However, existing low-bit quantization for the \emph{attention} module often introduces large quantization errors at very low bit-widths, leading to noticeable performance degradation . Prior work has mostly focused on smoothing techniques to mitigate outliers, whereas we emphasize a \emph{mixed-precision} design. Motivated by extensive heatmap analyses, we observe that LLM attention patterns typically contain a \emph{very small set of dominant vertical lines} that carry a disproportionate amount of attention mass. Accordingly, we preserve this small but crucial subset in full precision during quantized computation. Specifically, we propose \textbf{VQuant}. In the \textbf{Prefill} stage, we design a new quantized attention operator: during computation, we keep the identified vertical-line positions and a local sliding window in full precision, while quantizing the remaining parts to low bit-width. In the \textbf{Decode} stage, we follow the same principle: when quantizing the KV cache, we keep the vertical lines and the local window unquantized, and further fuse KV dequantization with attention computation to improve hardware efficiency. Empirically, in the \textbf{Prefill} stage, VQuant reduces MSE by about \textbf{5$\times$} with few extra computation overhead. In the \textbf{Decode} stage, VQuant combines KV quantization with fused attention, preserving end-to-end quality across benchmarks while achieving up to \textbf{3.58$\times$} speedup. With the two-stage co-design, VQuant achieves near lossless quality in end-to-end evaluations.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: Efficient/Low-Resource Methods for NLP

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 9716

Loading