MAGE: All-[MASK] Block Already Knows Where to Look in Block Diffusion LLM

Published: 01 Jun 2026, Last Modified: 10 Jun 2026AdaptFM PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion LLM, Sparse Attention
TL;DR: A sparse attention method for long-context inference of block diffusion LLM.
Abstract: Block diffusion LLMs are an emerging paradigm for parallel language generation, but their KV caching makes memory access the dominant bottleneck in long-context inference, motivating sparse attention that attends only to a small KV subset per query. In block diffusion, however, the B tokens of each block must share a single KV subset, and we show this per-block constraint degrades existing sparse KV estimators by up to 25% in recall. We address this by exploiting a property of the block-diffusion training objective: it aligns the block-average query across denoising steps, so the All-[MASK] block at the first step already reveals the per-block KV subset for the entire trajectory. Building on this, MAGE ([MASK]-Guided Sparse Attention) is a training-free method that runs one exact attention pass at the first step and reuses its top-k index sets for all remaining steps within the block. Across three block-diffusion families on LongBench, MAGE matches Exact Attention at k=512 with near-lossless accuracy, achieves up to 6.82× end-to-end speedup at 128K context, and runs up to 3.35× and 2.28× faster than Quest and SparseD, respectively.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 185
Loading