Towards Dynamic KV-Cache Compression: Fine-Grained Evaluation of Key and Value Ranks in LLMs

Published: 24 Sept 2025, Last Modified: 03 Oct 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: KV-cache, Large Language Models, Model Compression, Low-Rank Analysis
TL;DR: We propose an incremental algorithm for performing SVD on the KV-cache of large language models, yielding optimal low-rank decompositions, and use the resulting singular values to assess their data-dependent compressibility.
Abstract: Large language models rely on KV-cache to avoid redundant computation during autoregressive decoding, but reading and writing the growing cache quickly overwhelms GPU memory bandwidth as context length increases. Recent studies therefore explore KV-cache compression, however existing work either overlook the data-dependent nature of key/value features or their layer level differences. In this work, we propose a method that directly computes the optimal data-dependent compression of key and value activations via singular value decomposition during inference. Our approach is gradient-free and incremental, enabling independent per-layer decomposition with batch computation and low memory cost. Using this method, we conduct a comprehensive analysis across multiple models and datasets spanning diverse domains and languages, uncovering fine-grained patterns of KV-cache compressibility. Our method serves as a valuable evaluation tool to reveal how LLMs allocate their representational capacity, offering actionable insights for designing dynamic and data-aware KV-cache compression strategies for deployment.
Submission Number: 49
Loading