Keywords: Long-context large language model, KV cache eviction
TL;DR: We propose HS-SFT: a hybrid sparse SFT method that effectively improve the performance of offline KV cache eviction for LLMs.
Abstract: Long-context LLMs are constrained by the linear growth of key–value (KV) caches during autoregressive decoding, which incurs pronounced latency and memory overhead. KV eviction mitigates this issue, with existing efforts fall into offline policies with fixed eviction patterns and online policies that adaptively discard cache based on attention scores. While online eviction typically preserves accuracy under standard benchmarks, its performance can collapse in practical multi-turn dialogue scenarios where the query positions vary, and integration with pre-fill acceleration remains challenging. In contrast, offline eviction is infrastructure-friendly and generalizable but commonly sacrifices more accuracy. In this paper, we explore Supervised Fine-Tuning (SFT) for offline KV eviction and demonstrate its efficacy as a simple and powerful alternative to the design of complex online eviction metrics. We further propose Hybrid Sparse Supervised Fine-Tuning (HS-SFT) to explore the optimal offline design of KV eviction within SFT. In particular, HS-SFT employs a straight-through estimator to learn discrete local-window allocations of streaming heads across layers with budget-aware balancing loss, such that under high compression ratios—where dense-head capacity is constrained—the budget can be more effectively skewed to capture critical information. Across extensive evaluations on a wide array of LLMs and long-context tasks, HS-SFT delivers substantial performance gains over state-of-the-art eviction baselines. For example, with fewer than 4 hours of SFT of LLaMA-3-8B-1048K using a single 8-GPU node, HS-SFT achieves 5.86% and 38.3% higher average accuracy than Duo-attn on Longbench and Ruler-16K at 10% KV budget, respectively. These results position training-aware offline eviction—achieved with simple SFT—as an effective and practical path to scalable long-context inference. Code will be available.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4937
Loading