DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance

Xuan Shen; Chenxia Han; Yufa Zhou; Yanyue Xie; Yifan Gong; Quanyi Wang; Yiwei Wang; Yanzhi Wang; Pu Zhao; Jiuxiang Gu

DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance

Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jiuxiang Gu

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Generation, Efficient Video Generation, Sparse Attention

Abstract: Video generation models based on diffusion transformers have recently attracted widespread attention for their excellent generation quality. Despite recent progress, their computational expense remains the principal bottleneck. In particular, attention alone accounts for more than 80\% of the overall latency, and the synthesis of only 8 seconds 720p video takes tens of minutes, which severely restricts practical applicability and scalability. To address this, we propose **DraftAttention**, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs. The key idea is to compute the low-resolution draft attention based on the downsampled low-resolution query and key with minor computational overhead. The draft attention exposes redundancy both spatially within each feature map and temporally across frames, thus identifying the most important areas in the attention map. The resulting low-resolution sparse mask then guides full-resolution sparse attention computations. To align region-level sparsity with token-level computations, we further propose a deterministic reordering of tokens such that entries in each region become contiguous in memory, ensuring hardware-friendly execution of sparse attention. Our theoretical analysis demonstrates that the low-resolution draft attention closely approximates the full attention, providing reliable guidance for constructing accurate sparse attention. Experimental results show that our method outperforms existing sparse attention approaches in video generation quality and achieves up to 2x end-to-end speedup on GPUs.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 12174

Loading