Keywords: Sparse Attention
Abstract: Transformer models have revolutionized generation tasks, but the quadratic complexity of the attention mechanism limits scalability to long input sequences.
Prior work introduces sparse attention to mitigate this cost, but relies on contiguous memory access patterns that result in wasted computation for their proposed sparsity layouts, limiting practical efficiency.
In this paper, we propose $\textbf{I}$nterleaved $\textbf{N}$on-contiguous $\textbf{T}$oken spa$\textbf{R}$se $\textbf{A}$ttention (INTRA), a token-wise sparse attention framework that supports flexible sparsity by redesigning memory access patterns. INTRA's loading unit is a single token, and it can load potentially non-contiguous tokens in global memory to contiguous space in shared memory. We formalize this design as the $\textbf{ISPD}$ (Intra Sparse Pattern Design) principle, a general guidance for constructing sparsity layouts that are efficient for GPUs.
INTRA achieves competitive performance on both image and language generation tasks, while accelerating attention by more than $3.3\times$.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 14777
Loading