SparCas: A Dimension-First Cascade for Efficient Long-Context LLM Inference

Guoxin Sun; Ming Zhao

SparCas: A Dimension-First Cascade for Efficient Long-Context LLM Inference

Guoxin Sun, Ming Zhao

19 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Inference, KV Cache, Sparsity

TL;DR: SparCas cascades sparsity from dimensions to tokens for efficient KV cache selection, reducing memory traffic while preserving dense-attention accuracy in long-context LLMs.

Abstract: Large language models (LLMs) have demonstrated strong capability in handling long-context sequences, but inference efficiency is bottlenecked by the continuously growing KV cache. KV cache selection methods mitigate this by retaining only "important" tokens for attention, yet existing solutions face a fundamental dilemma: they either rely on coarse page-level heuristics (sacrificing precision) or expensive token-wise approximation or scanning (sacrificing speed). We introduce Sparsity Cascade (SparCas), a novel dimension-first cascade that resolves this trade-off. SparCas is grounded in the empirical discovery that token importance ranking is remarkably stable to dimension pruning. Leveraging this, we instantiate a prune-in-prune cascade: (i) intra-token sparsity first prunes the feature space to identify critical dimension indices using a lightweight queryonly proxy, and (ii) cross-token sparsity then prunes the context length by using this tiny subset of dimensions to efficiently filter for salient token indices. This approach effectively decouples the cost of ranking from the context length. Across extensive evaluations on PG-19, LongBench, and RULER, SparCas consistently matches or outperforms dense attention and prior baselines, achieving oracle-level accuracy with budgets as few as 1% of tokens at a 32K-token context. Integrated into FlashInfer, SparCas delivers up to 3.01× faster selfattention and 1.64× end-to-end speedups. Our project is anonymously available at https://anonymous.4open.science/r/sparcas/.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 14902

Loading