Training-Free Native Sparse Attention for KV Cache Compression

ICLR 2026 Conference Submission16673 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, KV Cache Compression, Training-Free
TL;DR: A training-free, hierarchical block-wise KV cache compression method that dramatically reduces memory and speeds up inference for LLM, while maintaining high accuracy and compatibility with existing frameworks.
Abstract: Large language models (LLMs) suffer from inference inefficiency as KV cache memory and computation scale linearly with context length. Existing KV cache compression methods typically use attention-score-based token-level selection, which leads to uneven attention distributions—overemphasizing prompt boundaries and neglecting global context. We propose a novel training-free hierarchical block-wise KV cache compression method with two key innovations: (1) block-wise selection that achieves superior precision over token-level approaches, and (2) a hierarchical selection strategy that preserves global context without extra training. Our approach adapts insights from Native Sparse Attention to the KV cache compression setting, enabling plug-and-play integration into existing pre-trained models. Extensive experiments demonstrate significant improvements: 16× compression ratio on 32K sequences, reduces KV cache by over 90%, accelerates decoding by 4x, and maintains over 99%+ accuracy. Our training-free solution offers universal compatibility with existing LLM frameworks for practical long-context applications.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16673
Loading