TL;DR: An effective KV compression technique for long-response benchmarks
Abstract: Autoregressive Transformers rely on Key-Value (KV) caching to accelerate inference. However, the linear growth of the KV cache with context length leads to excessive memory consumption and bandwidth constraints. Existing methods drop distant tokens or compress states in a lossy manner, sacrificing accuracy by discarding vital context or introducing bias.
We propose ${MorphKV}$, an inference-time technique that maintains a constant-sized KV cache while preserving accuracy. MorphKV balances long-range dependencies and local coherence during text generation. It eliminates early-token bias while retaining high-fidelity context by adaptively ranking tokens through correlation-aware selection. Unlike heuristic retention or lossy compression, MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens. This approach captures inter-token correlation with greater accuracy, which is crucial for tasks like content creation and code generation. Our studies on long-response tasks show 52.9\% memory savings and 18.2\% higher accuracy on average compared to state-of-the-art prior works, enabling efficient deployment.
Lay Summary: The increasing size of KV caches presents a critical bottleneck, particularly for long-response tasks, such as content creation, code generation, etc. Compressed KV caches address this challenge by evicting unimportant tokens, but presents a trade-off between accuracy versus memory savings. Retaining too few KVs reduces accuracy, whereas accurate methods are not memory-efficient. Ideally, we want to compress KV caches without sacrificing accuracy.
We introduce $MorphKV$, a dynamic KV cache pruning method that maintains a constant-size KV cache by keeping only a select subset of KVs. MorphKV improves accuracy by dynamically preserving only those KVs that exhibit strong correlation with recently generated tokens. This enables MorphKV to efficiently handle long-context and long-response tasks, even when operating with limited hardware resources.
Link To Code: https://github.com/ghadiaravi13/MorphKV
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Model, Key-Value Cache Compression
Submission Number: 13734
Loading