Abstract: For the efficient inference of Large Language Models (LLMs), the effective compression of key-value ($KV$) caches is essential. Three main types of $KV$ cache compression techniques, namely sparsity, channel compression, and quantization, have been identified. This study presents SVDq, a Singular Value Decomposition (SVD) - based mixed-precision quantization method for the $K$ cache. Initially, the $K$ cache is transformed into “latent channels” using SVD basis representations. Since the values in latent channels decay rapidly and become negligible after only a few latent channels, our method then incorporates importance-aware quantization and compression for latent channels. This enables the effective allocation of higher precision to more significant channels. Theoretically, we prove that SVDq results in quantization errors ($\times0.1$ or even lower) that are much lower than those of per - channel key quantization in the original space. Our findings demonstrate that SVDq can achieve an equivalent key cache precision as low as $\textbf{1.25}$-bit. When combined with key sparsity, it can reach a key compression ratio of up to $\textbf{410}\times$ for attention computation, all while maintaining comparable model performance. This indicates that SVDq enables high-precision low-bit quantization, providing a more efficient solution for $KV$ cache compression in LLMs.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Efficient/Low-Resource Methods for NLP, Language Modeling
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 3061
Loading