FG-ATTN: LEVERAGING FINE-GRAINED SPARSITY IN DIFFUSION TRANSFORMERS

FG-ATTN: LEVERAGING FINE-GRAINED SPARSITY IN DIFFUSION TRANSFORMERS

ICLR 2026 Conference Submission20791 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Attention, Sparsity, Sparse Attention, Diffusion Transformers

Abstract: Generating realistic videos/images with diffusion transformers requires evaluating attention over extremely long sequences, with attention layers accounting for the majority of generation latency. Exploiting sparsity in attention maps offers a promising opportunity to reduce this cost. However, existing methods rely on block-sparse attention, which skips attention computation only when all scores within a coarse M×M tile (typically 64×64) are expected to be negligible. This coarse-grained skipping leaves a large fraction of redundant computation unad- dressed. In this work, we show that attention maps in diffusion transformers exhibit significant fine-grained sparsity. Leveraging this efficiently on modern GPUs is challenging, as fine-grained skipping introduces irregular memory access, can reduce tensor core utilization, and it is difficult to determine which computation to skip without loss in accuracy. We propose FG-Attn, a novel fine-grain sparse attention mechanism that skips score computations at the granularity of M×1 slices, where each slice is the result of query-key dot products between M query vectors and a single key. We introduce a highly efficient asynchronous gather-load primitive that loads only the sparse set of key/value vectors into tensor-core-compatible tiles in the on-chip GPU shared memory, hiding the overhead of irregular memory access. We develop two training-free, lightweight prediction strategies that identify redundant scores to skip with negligible overhead. FG-Attn can fully supercede existing block sparsity methods in DiTs, and we demonstrate that it achieves up to 1.65× speedup (1.48× on avg) for state-of-art video models on an H100 GPU.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 20791

Loading