HEX: Merging Heavy-Hitters and Expanders for Adaptive KV Cache Optimization in Long-Context Inference

HEX: Merging Heavy-Hitters and Expanders for Adaptive KV Cache Optimization in Long-Context Inference

ICLR 2026 Conference Submission25279 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Key-Value Caching, Efficient Inference, Memory Optimization, KV Cache Compression, Structural Sparsity, Expander Graphs, Long Context Inference, Heavy-Hitters

TL;DR: HEX combines expander-graph sparsity with dynamic token selection and quantization to compress KV caches, achieving strong accuracy–efficiency trade-offs for long-context inference.

Abstract: Key–Value (KV) caching accelerates large-language model inference but grows linearly with sequence length, quickly exhausting GPU memory. Existing compression strategies such as quantization, pruning, or sparsification shrink this footprint, but often degrade performance. Most pruning methods discard crucial connections and disrupt information flow, while dynamic heuristics often lack theoretical basis. We propose HEX, a cache compression strategy that is both structurally efficient and adaptive. HEX constructs a sparse backbone using expander graphs with spectral guarantees on connectivity, and augments it with heavy-hitter and recent tokens to capture input-specific context. The selected entries are stored in full precision, while the remaining cache is quantized to retain information at low cost. The expander masks are precomputed and static, thus significantly reducing computational overhead and aiding sparse implementations. Experiments on GSM8k, CoQA, TruthfulQA, and LongBench across models of varying sizes show that HEX consistently outperforms existing methods at higher compression rates without retraining. These results illustrate how principled eviction layouts grounded in graph structure and input dynamics can yield stronger accuracy–efficiency trade-offs for long-context inference even for limited cache budgets.

Primary Area: generative models

Submission Number: 25279

Loading