SyncKV: A Syncopated Scheduling Approach to KV Cache Compression for Efficient Long-Context LLM Inference

Zhiyuan Shi; Qibo Qiu; Xuefeng; Zhonglin Jiang; Li Yu; Jian Jiang; Xiaofei He; Wenxiao Wang

SyncKV: A Syncopated Scheduling Approach to KV Cache Compression for Efficient Long-Context LLM Inference

Zhiyuan Shi, Qibo Qiu, Xuefeng, Zhonglin Jiang, Li Yu, Jian Jiang, Xiaofei He, Wenxiao Wang

16 Sept 2025 (modified: 01 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: NLP, Large Language Models

TL;DR: We proposed SyncKV, which compresses the KV cache by dynamically recalling important tokens, thereby reducing GPU memory usage and accelerating long context inference.

Abstract: KV cache accelerates the inference of Large Language Models (LLMs) by caching the key and value states of previous tokens, but its linearly increasing memory footprint poses a huge bottleneck for long-context tasks. To mitigate this, many previous studies evict unimportant tokens based on attention scores from the prefill stage or cumulative attention. However, by permanently evicting tokens, such static compression algorithms fail to preserve globally important tokens, as they overlook the "attention drift" phenomenon inherent in inference. Our analysis highlights this drift, showing that after generating just 50 tokens, the set of important tokens retains only about a 30\% overlap with the one identified during the prefill stage. To address this, our core innovative insight is twofold: (1) the set of important tokens exhibits high temporal locality across adjacent generation steps, and (2) this set is highly similar among attention heads within the same layer. Based on these insights, we propose SyncKV, a training-free dynamic KV cache compression method. SyncKV takes advantage of these properties through a novel syncopated strategy in which a few "representative heads" periodically identify important tokens, triggering an asynchronous upload of the relevant KV cache from the CPU. We designed a parallelization strategy that overlaps the I/O overhead with the subsequent forward computing stage, thereby effectively hiding the delay of data transmission and achieving an acceleration effect. Experiments show that SyncKV has achieved state-of-the-art performance in multiple long-context benchmarks, reducing the GPU memory usage of the KV cache by up to 80\%. Our code will be open-source.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 7133

Loading