SparseD: Sparse Attention for Diffusion Language Models

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Language Models, Sparse Attention
Abstract: While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention’s quadratic complexity with respect to context length in computing all query–key pairs. Intuitively, to reduce this complexity, restricting computation to sparse attention patterns that retain only the most important ones offers an effective solution. This type of method is widely used in ARs, where the attention mechanism exhibits clear and fixed sparse patterns. In DLMs, our analysis also reveals the presence of sparse patterns and further highlights three unique observations: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These unique findings render well-studied fixed sparse attention methods in ARs largely incompatible with DLMs, as their fixed patterns fail to capture head-specific patterns in DLMs, and sparse attention applied in the early steps can lead to degradation in generation. To address these challenges, we propose **SparseD**, a novel sparse attention method for DLMs. Leveraging the observations in DLMs, SparseD only pre-computes and selects the most important query–key pairs once as head-specific sparse patterns for reusing across denoising steps. This manner can handle head-specific patterns without incurring the high latency associated with recomputing sparse patterns at each denoising step. Meanwhile, SparseD skips sparse attention and uses full attention in the early steps to preserve generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to $1.50\times$ speedup over FlashAttention at a 64k context length with 1,024 denoising steps. Anonymous code is available at https://anonymous.4open.science/r/SparseD-8C76/.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 2226
Loading