SentKVCompress: Sentence-Level Dynamic KVCache Compression for Efficient Long-Context LLM Inference

15 Sept 2025 (modified: 27 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: KVCache; Sentence-level; LLM; Long Context
Abstract: The demand for million-token-scale contexts in the agent era has expanded the KVCache to terabyte levels, creating severe inference bottlenecks due to high storage overhead and frequent memory access. Existing KV compression methods struggle to alleviate this memory pressure efficiently, facing challenges with the accuracy or efficiency of KVCache storage and usage. Challenge 1–At the KV storage level, KV preprocessing mechanisms face accuracy and efficiency challenges: those with information loss suffer from a significant end-to-end accuracy drop of over 30%, while near-lossless methods incur substantial overhead, resulting in a superlinear increase in inference time with context length. Challenge 2–At the usage level, KV selection mechanisms face a similar dilemma. Static Selection (e.g., Attention Sink) fails to capture semantic relationships, yielding low recall (<50% for top-10 tokens). Dynamic selection (e.g., online score calculation) incurs prohibitive overhead, consuming over 60% of KV selection GPU time and 70% of CPU-GPU memory bandwidth with redundant transfers. Our core insight is that the above challenges arise because existing methods follow an unstructured, token-level compression paradigm. This focus on discrete tokens, which inherently lack semantic structure, forces the model to expend substantial additional computation to implicitly re-extract structural information from long texts during inference. To address this, we observe that attention scores aggregate naturally at the sentence level. Leveraging this finding, we propose SentKVCompress, a novel sentence-level dynamic KV cache management framework that explicitly extracts and utilizes this inherent structural information. At the KVCache storage level, to address the challenge of accuracy loss and high preprocessing overhead, we propose a Sentence-level perceived KVCache preprocess framework, maintaining accuracy while cutting overhead to below 20%. At the KVCache usage level, to address the challenge of imprecise selection and high additional overhead, we propose a sentence semantic-driven KVCache selection strategy, enabling 70% of KVs to be reused. Experiments show that SentKVCompress achieves a maximum speedup of 4.2× with nearly no accuracy loss, and reduces the peak memory by 2.7× in long context scenarios, while also achieving the highest accuracy at equivalent KV usage rates. The code will be open-source in https://github.com/Indexleaf475/ICLR26-SentKVCompress
Primary Area: generative models
Submission Number: 5598
Loading