Keywords: Sparse Attention, Efficient Attention, Efficient LLM
Abstract: Attention scales quadratically with sequence length, fundamentally limiting long-context inference. Existing block-granularity sparsification can reduce latency, but coarse blocks impose an intrinsic sparsity ceiling, making further improvements difficult even with carefully engineered designs. We present **S2O**, which performs early **s**topping for **s**parse attention via **o**nline permutation.
Inspired by virtual-to-physical address mapping in memory systems, S2O revisits and factorizes FlashAttention execution, enabling inference to load non-contiguous tokens rather than a contiguous span in the original order. Motivated by fine-grained structures in attention heatmaps, we transform explicit permutation into an online, index-guided, discrete loading policy; with extremely lightweight preprocessing and index-remapping overhead, it concentrates importance on a small set of high-priority blocks. Building on this importance-guided online permutation for loading, S2O further introduces an early-stopping rule: computation proceeds from high to low importance; once the current block score falls below a threshold, S2O terminates early and skips the remaining low-contribution blocks, thereby increasing effective sparsity and reducing computation under a controlled error budget.
As a result, S2O substantially raises the practical sparsity ceiling. On Llama-3.1-8B under a **128K** context, S2O reduces single-operator MSE by **3.82$\times$** at matched sparsity, and reduces prefill compute density by **3.31$\times$** at matched MSE; meanwhile, it preserves end-to-end accuracy and achieves **7.51$\times$** attention and **3.81$\times$** end-to-end speedups.
Paper Type: Long
Research Area: Low-resource Methods for NLP
Research Area Keywords: Efficient/Low-Resource Methods for NLP
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 650
Loading