Abstract: Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 950 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the
observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being \emph{hardware-efficient}. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79\% MFU -- 7.17× faster than prior art methods. On the leading video DiT model, Hunyuan, it accelerates attention by 1.6–10x over FlashAttention-3, yielding a 1.36–3.53× end-to-end speedup with no or minimum quality loss.
Lay Summary: Creating high-quality videos using AI models requires a huge amount of computation. For example, generating just a 5-second high-resolution video using current methods can take over 16 minutes—even on powerful hardware. The main bottleneck is a process called "attention," which helps the AI understand relationships between different parts of the video but is extremely slow.
In this work, we introduce a new technique called Sliding Tile Attention that makes this process dramatically faster. Instead of treating every part of the video equally, our method takes advantage of a simple observation: nearby frames and pixels often contain similar information. By focusing only on these local areas and processing them in larger chunks (or "tiles"), we eliminate a lot of unnecessary work.
Our method works right out of the box with existing video AI models and cuts the video generation time nearly in half—without hurting quality. When further fine-tuned, it can make video generation up to 3.5× faster, opening the door to faster and more practical AI-powered video tools.
Link To Code: https://github.com/hao-ai-lab/FastVideo
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Diffusion model, attention, machine learning system.
Submission Number: 1666
Loading