NABLA: Neighborhood-Adaptive Block-Level Attention for Efficient Video Diffusion Transformers

ICLR 2026 Conference Submission16730 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video generation, diffusion transformers, sparse attention, adaptive computation
TL;DR: A novel sparse attention approach for transformer-based architectures in video generation tasks
Abstract: Full self‑attention in video diffusion transformers scales quadratically with the spatio‑temporal token count, making processing the high‑resolution clips prohibitively slow and memory‑heavy. We introduce NABLA, a Neighborhood‑Adaptive Block‑Level Attention mechanism that builds a per‑head sparse mask in three steps: (i) average‑pool queries and keys into $N\times N$ blocks, (ii) keep the highest‑probability blocks via a cumulative‑density threshold, and (iii) optionally union the result with Sliding‑Tile Attention (STA) to suppress border artefacts. NABLA drops straight into PyTorch's FlexAttention with no custom kernels or extra losses. On the Wan 2.1 14B text‑to‑video model at 720p, NABLA accelerates training and inference by up to $2.7\times$ while matching CLIP ($42.06\rightarrow42.08$), VBench ($83.16\rightarrow 83.17$) and FVD ($68.9\rightarrow 67.5$) scores. During pre‑training of a 2B DiT at $512^2$, iteration time falls from 10.9s to 7.5s ($1.46\times$) with lower validation loss. A link to the code and model weights will be published in the camera-ready version of the paper.
Primary Area: generative models
Submission Number: 16730
Loading