Keywords: Inference Optimization, KV Cache Compression, Low-rank Projection
Abstract: KV cache has become a *de facto* technique for the inference of large language models (LLMs), where tensors of shape (layer number, head number, sequence length, feature dimension) are introduced to cache historical information for self-attention.
As the size of the model and data grows, the KV cache can, yet, quickly become a bottleneck within the system in both storage and memory transfer.
To address this, prior studies usually focus on the first three axes of the cache tensors for compression.
This paper supplements them, focusing on the feature dimension axis,
by utilizing low-rank projection matrices to transform the cache features into spaces with reduced dimensions.
We begin by investigating the canonical orthogonal projection method for data compression through principal component analysis (PCA).
We identify the drawback of PCA projection that model performance degrades rapidly under relatively low compression rates (less than 60%).
This phenomenon is elucidated by insights derived from the principles of attention mechanisms.
To bridge the gap, we propose to directly tune the orthogonal projection matrix on the continual pre-training or supervised fine-tuning datasets with an elaborate Matryoshka learning strategy.
Thanks to such a strategy, we can adaptively search for the optimal compression rates for various layers and heads given varying compression budgets.
Compared to Multi-head Latent Attention (MLA), our method can easily embrace pre-trained LLMs and hold a smooth tradeoff between performance and compression rate.
We witness the high data efficiency of our training procedure and find that our method can sustain over 90\% performance with an average KV cache compression rate of 60% (and up to 75% in certain extreme scenarios) for popular LLMs like LLaMA2 and Mistral.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8952
Loading