S2O: Early Stopping for Sparse Attention via Online Permutation

S2O: Early Stopping for Sparse Attention via Online Permutation

ACL ARR 2026 January Submission650 Authors

24 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse Attention, Efficient Attention, Efficient LLM

Abstract: Attention scales quadratically with sequence length, fundamentally limiting long-context inference. Existing block-granularity sparsification can reduce latency, but coarse blocks impose an intrinsic sparsity ceiling, making further improvements difficult even with carefully engineered designs. We present **S2O**, which performs early **s**topping for **s**parse attention via **o**nline permutation. Inspired by virtual-to-physical address mapping in memory systems, S2O revisits and factorizes FlashAttention execution, enabling inference to load non-contiguous tokens rather than a contiguous span in the original order. Motivated by fine-grained structures in attention heatmaps, we transform explicit permutation into an online, index-guided, discrete loading policy; with extremely lightweight preprocessing and index-remapping overhead, it concentrates importance on a small set of high-priority blocks. Building on this importance-guided online permutation for loading, S2O further introduces an early-stopping rule: computation proceeds from high to low importance; once the current block score falls below a threshold, S2O terminates early and skips the remaining low-contribution blocks, thereby increasing effective sparsity and reducing computation under a controlled error budget. As a result, S2O substantially raises the practical sparsity ceiling. On Llama-3.1-8B under a **128K** context, S2O reduces single-operator MSE by **3.82$\times$** at matched sparsity, and reduces prefill compute density by **3.31$\times$** at matched MSE; meanwhile, it preserves end-to-end accuracy and achieves **7.51$\times$** attention and **3.81$\times$** end-to-end speedups.

Paper Type: Long

Research Area: Low-resource Methods for NLP

Research Area Keywords: Efficient/Low-Resource Methods for NLP

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 650

Loading