Keywords: Video generation, diffusion transformers, sparse attention, adaptive computation
TL;DR: A novel sparse attention approach for transformer-based architectures in video generation tasks
Abstract: Full self‑attention in video diffusion transformers scales quadratically with the spatio‑temporal token count, making processing the high‑resolution clips prohibitively slow and memory‑heavy. We introduce NABLA, a Neighborhood‑Adaptive Block‑Level Attention mechanism that builds a per‑head sparse mask in three steps: (i) average‑pool queries and keys into $N\times N$ blocks, (ii) keep the highest‑probability blocks via a cumulative‑density threshold, and (iii) optionally union the result with Sliding‑Tile Attention (STA) to suppress border artefacts. NABLA drops straight into PyTorch's FlexAttention with no custom kernels or extra losses. On the Wan 2.1 14B text‑to‑video model at 720p, NABLA accelerates training and inference by up to $2.7\times$ while matching CLIP ($42.06\rightarrow42.08$), VBench ($83.16\rightarrow 83.17$) and FVD ($68.9\rightarrow 67.5$) scores. During pre‑training of a 2B DiT at $512^2$, iteration time falls from 10.9s to 7.5s ($1.46\times$) with lower validation loss. A link to the code and model weights will be published in the camera-ready version of the paper.
Primary Area: generative models
Submission Number: 16730
Loading