Keywords: attention, block attention, video generation, dynamic algorithm, theory, data structure
Abstract: Recent progress in video modeling has been largely driven by Transformer architectures, which simulate dependency relationships across spatial patches and temporal frames. However, compared to text or image modeling, video modeling involves orders of magnitude more tokens, resulting in an input sequence several orders of magnitude longer than typical NLP or image tasks, and makes the attention mechanism the primary computational bottleneck. The naive method flattens $f$ frames of $n$ tokens each into length $N = nf$, incurring total $O(n^2f^2)$ attention cost.
Prior work (e.g., radial/axial variants) attains subquadratic time only when either the spatial or temporal dimension is small. We present a dynamic algorithm that computes block attention in $O(\mathcal{T}_\mathrm{mat}
(n,n,n^a) \frac{f}{n^{a}})$ amortized running time, where $a \in [0,1)$.
Primary Area: generative models
Submission Number: 15689
Loading