Keywords: KV cache, Efficient LLM, low rank compression
TL;DR: We propose StreamingKV, a dynamic key-value cache compression method for large language models that adapts low-rank projection bases in real time motivated with Generalized Hebbian Algorithm.
Abstract: Modern large language models (LLMs) face severe memory bottlenecks during inference due to the ever-growing key-value (KV) cache, especially in long-context settings. While recent low-rank compression techniques mitigate this issue, their reliance on static projection bases leads to suboptimal generalization across diverse prompts—ultimately compromising model performance. We introduce StreamingKV, an adaptive compression framework that dynamically updates the low-rank projection bases during inference, inspired by the Generalized Hebbian Algorithm (GHA). Unlike static methods, StreamingKV tailors the projection subspace to each input prompt in real-time, significantly enhancing representation quality with minimal computational overhead. Extensive experiments across multiple model families on long-context tasks demonstrate that StreamingKV consistently improves accuracy under the same compression ratio with negligible latency increase.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 11368
Loading