SparseSkeleton: Prefill sparse attention by decomposition

ICLR 2026 Conference Submission19739 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Long-context Prefill, Decomposition, Sparse Pattern, Sparse Attention, Dynamic Sparse Attention.
Abstract: Multi-head attention (MHA) and grouped query head attention (GQA) consti- tute essential architectural components of modern large language models (LLMs). Even though attention computations remain relatively inexpensive for small-scale inputs, the computational cost increases quadratically as the input size expands. In long-context scenarios, including tasks such as book-level summarization or code repos analysis, time-to-first-token (TTFT) performance can deteriorate sig- nificantly. Although various studies have improved prefill stage performance by exploiting sparsity structure, sparsity can still be further increased with structure refinements. In this work, we propose an approximate on-line decomposition of the attention matrix which is able to dynamically identify additional sparsity. The attention matrix is decomposed into three components: a slash component, a vertical com- ponent, and a horizontal component. Each component requires only linear space, thereby enabling more efficient processing compared to the full attention matrix. The decomposition is computed from query and key tokens using a linear-time algorithm. The statistical properties of the decomposition allow generation of the mask by merely selecting elements that exceed a threshold. The threshold itself can be chosen to limit the difference with regular dense attention or to respect a certain time-budget. We demonstrate that this technique can be directly applied – without requiring retraining – to networks employing standard dense attention mechanisms (MHA, GQA) and RoPE. We show that precision is maintained across the ∞Bench and PG-19 benchmarks for LLAMA-3-8B-INSTRUCT-1048K. Furthermore, we ob- serve substantial increases in sparsity and corresponding speedup compared to previous methods. We halve the number of FLOP relative to State-of-the-Art on one million tokens.
Primary Area: generative models
Submission Number: 19739
Loading