INTRA: Interleaved Non-contiguous Token spaRse Attention

INTRA: Interleaved Non-contiguous Token spaRse Attention

ICLR 2026 Conference Submission14777 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse Attention

Abstract: Transformer models have revolutionized generation tasks, but the quadratic complexity of the attention mechanism limits scalability to long input sequences. Prior work introduces sparse attention to mitigate this cost, but relies on contiguous memory access patterns that result in wasted computation for their proposed sparsity layouts, limiting practical efficiency. In this paper, we propose $\textbf{I}$nterleaved $\textbf{N}$on-contiguous $\textbf{T}$oken spa$\textbf{R}$se $\textbf{A}$ttention (INTRA), a token-wise sparse attention framework that supports flexible sparsity by redesigning memory access patterns. INTRA's loading unit is a single token, and it can load potentially non-contiguous tokens in global memory to contiguous space in shared memory. We formalize this design as the $\textbf{ISPD}$ (Intra Sparse Pattern Design) principle, a general guidance for constructing sparsity layouts that are efficient for GPUs. INTRA achieves competitive performance on both image and language generation tasks, while accelerating attention by more than $3.3\times$.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 14777

Loading