TornadoAttention: Hardware-Efficient Sparse Attention via Fine-Grained Spatio-Temporal Permutation for Video Generation

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Attention, Video Generation
TL;DR: A training-free hardware-efficient sparse attention via fine-grained spatio-temporal permutation for video generation.
Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable success in video generation. However, their core component, the self-attention mechanism, suffers from quadratic complexity. To alleviate this issue, sparse attention mechanisms have been proposed. Existing methods, however, often impose strong, handcrafted priors, restricting attention to a few fixed patterns, fail to capture the diverse, data-dependent attention patterns unique to each layer and head. Motivated by the spatio-temporal locality of self-attention in DiTs, we propose TornadoAttention, a \textbf{training-free} sparse attention mechanism. Our key idea is to apply a fine-grained permutation of the query and key sequences that better matches the underlying attention structure. We applied TornadoAttention to advanced open-source video generation models. It reveal that the attention masks obtained through offline searching exhibit excellent generalization capabilities across a diverse range of prompts, which provides a crucial foundation for aggressive, hardware-specific kernel-level optimizations. On the HunyuanVedio model, our method achieves a 1.4$\times$ speedup with negligible loss in fidelity.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 15710
Loading