LoSA: Locality Aware Sparse Attention in Diffusion Language Models

Haocheng Xi; Harman Singh; Yuezhou Hu; Coleman Richard Charles Hooper; Rishabh Tiwari; Aditya Tomar; Minjae Lee; Wonjun Kang; Michael W. Mahoney; Chenfeng Xu; Kurt Keutzer; Amir Gholami

LoSA: Locality Aware Sparse Attention in Diffusion Language Models

Haocheng Xi, Harman Singh, Yuezhou Hu, Coleman Richard Charles Hooper, Rishabh Tiwari, Aditya Tomar, Minjae Lee, Wonjun Kang, Michael W. Mahoney, Chenfeng Xu, Kurt Keutzer, Amir Gholami

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a sparse attention method for diffusion language models for long context scenarios.

Abstract: Block-wise diffusion language models (DLMs) generate multiple tokens in parallel, offering a promising alternative to autoregressive decoding. However, their inference efficiency remains bottlenecked by memory-bound attention in long-context scenarios. Naïve sparse attention is ineffective for DLMs due to the KV inflation problem: different queries select different prefix positions, causing the union of accessed KV pages to remain large. To address this challenge, we observe that block-wise diffusion exhibits locality of representation changes across denoising steps: only a small fraction of tokens (active tokens) undergo significant hidden-state updates, while most tokens (stable tokens) remain nearly unchanged. Based on this insight, we propose LoSA (Locality-aware Sparse Attention), which reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens with large representation changes. This design reduces the number of queries contributing to the union of KV indices, substantially shrinking the KV pages that must be loaded. Across multiple block-wise DLMs and reasoning benchmarks, LoSA preserves near-dense accuracy while significantly improving efficiency, achieving up to 4.14× speedup over dense attention on RTX A6000 GPUs. LoSA also delivers up to 5% average improvement over baselines across all datasets and configurations, demonstrating the effectiveness of the proposed method.

Lay Summary: Large language models usually write text one word at a time, which can be slow. Block-wise diffusion language models offer a different approach: they revise a whole group of words together, more like editing a sentence draft than typing it from left to right. This parallel editing could make generation faster, but the model still spends much of its time looking back over long context, which requires moving large amounts of memory. A common shortcut is to make each word look at only a small part of the context, but in block-wise diffusion different words often choose different parts, so the system still has to load nearly as much memory as before. We introduce LoSA, a method that takes advantage of how these models change during editing. Most words barely change from one refinement step to the next, while only a few are actively being revised. LoSA reuses previous attention results for the stable words and performs the expensive context lookup only for the active ones. This reduces memory movement while keeping the model’s answers close to those of full attention, enabling faster long-context reasoning with little or no loss in accuracy.

Primary Area: Deep Learning->Algorithms

Keywords: sparse attention, diffusion language models, long context

Originally Submitted PDF: pdf

Submission Number: 20137

Loading