SVDq: 1.25-bit and $410 \times$ Key Cache Compression for LLM Attention Computation

SVDq: 1.25-bit and $410 \times$ Key Cache Compression for LLM Attention Computation

ACL ARR 2025 February Submission3061 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: For the efficient inference of Large Language Models (LLMs), the effective compression of key-value ($KV$) caches is essential. Three main types of $KV$ cache compression techniques, namely sparsity, channel compression, and quantization, have been identified. This study presents SVDq, a Singular Value Decomposition (SVD) - based mixed-precision quantization method for the $K$ cache. Initially, the $K$ cache is transformed into “latent channels” using SVD basis representations. Since the values in latent channels decay rapidly and become negligible after only a few latent channels, our method then incorporates importance-aware quantization and compression for latent channels. This enables the effective allocation of higher precision to more significant channels. Theoretically, we prove that SVDq results in quantization errors ($\times0.1$ or even lower) that are much lower than those of per - channel key quantization in the original space. Our findings demonstrate that SVDq can achieve an equivalent key cache precision as low as $\textbf{1.25}$-bit. When combined with key sparsity, it can reach a key compression ratio of up to $\textbf{410}\times$ for attention computation, all while maintaining comparable model performance. This indicates that SVDq enables high-precision low-bit quantization, providing a more efficient solution for $KV$ cache compression in LLMs.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Efficient/Low-Resource Methods for NLP, Language Modeling

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 3061

Loading