StreamingKV: Adaptive Low-Rank KV Caching with Test-time Updates

Seojin Kim; Noseong Park

StreamingKV: Adaptive Low-Rank KV Caching with Test-time Updates

Seojin Kim, Noseong Park

18 Sept 2025 (modified: 23 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: KV cache, Efficient LLM, low rank compression

TL;DR: We propose StreamingKV, a dynamic key-value cache compression method for large language models that adapts low-rank projection bases in real time motivated with Generalized Hebbian Algorithm.

Abstract: Modern large language models (LLMs) face severe memory bottlenecks during inference due to the ever-growing key-value (KV) cache, especially in long-context settings. While recent low-rank compression techniques mitigate this issue, their reliance on static projection bases leads to suboptimal generalization across diverse prompts—ultimately compromising model performance. We introduce StreamingKV, an adaptive compression framework that dynamically updates the low-rank projection bases during inference, inspired by the Generalized Hebbian Algorithm (GHA). Unlike static methods, StreamingKV tailors the projection subspace to each input prompt in real-time, significantly enhancing representation quality with minimal computational overhead. Extensive experiments across multiple model families on long-context tasks demonstrate that StreamingKV consistently improves accuracy under the same compression ratio with negligible latency increase.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 11368

Loading