Keywords: compression, kv, cache, generation, efficicency
TL;DR: Learning dynamic latent dymansions for KV cache compression
Abstract: Efficient inference within Large Language Models (LLMs) commonly assumes the usage of a key-value (KV) cache. However, while it removes the quadratic bottleneck of vanilla attention, it trades it for a proportional—and often prohibitive—memory footprint that scales linearly with the sequence length. Modern approaches to reducing the KV cache memory either use token eviction or deterministic dimensionality reduction methods applied with a uniform budget to each layer. With this inflexibility in mind, we introduce a \emph{learnable adaptive compression} method that dynamically retrofits existing KV cache for each layer with a trainable compression budget and encoding and decoding components. Experiments on $\text{LLaMA‑3.1‑8B}$ across various benchmarks show that our method allows maintaining original model performance within $1\%$ during $\times2$ and $\times4$ KV cache compression and $2\%-4\%$ for $\times4$ and $\times8$ reduction. Our experiments also show that this trainable adaptive budgeting allows the model to devote more capacity to late layers, where semantic abstractions are denser, which offers layer‑wise interpretability of attention sparsity, opening the door to principled analysis and hardware‑aware scheduling during inference.
Submission Number: 300
Loading