Keywords: Efficient Attention, Sparse Attention, Video Generation
TL;DR: We propose Unified Tail Aggregation (UTA) to resolve the structural errors of Top-$K$ sparse attention in Diffusion Transformers by aggregating the discarded attention tail, achieving up to a 97.4% MSE reduction.
Abstract: While Diffusion Transformers (DiTs) deliver remarkable visual quality in video generation, the massive computational overhead of 3D spatio-temporal attention limits their scalability. To evaluate and optimize sparse attention mechanisms, existing studies predominantly rely on the Top-$K$ Oracle Policy. However, this approach employs rigid truncation that naively discards the continuous tail of the attention distribution, introducing structural errors that degrade temporal consistency during iterative diffusion processes. To address this fundamental flaw, we provide an oracle analysis of these distributional shifts and introduce a novel Low-Delta Oracle Policy. Building on a mathematical proof demonstrating that sparse attention achieves zero error when grouping identical attention scores, our approach prioritizes the structural integrity of the entire attention distribution. As a promising correction mechanism, we propose a Unified Tail Aggregation (UTA) method. By aggregating logits where the score variance is bounded by a marginal delta, UTA supplements a single aggregated logit to restore the attention distribution. Extensive empirical evaluations demonstrate that our approach significantly outperforms the Top-$K$ oracle, achieving up to a 97.4\% reduction in mean squared error (MSE) at a 50\% sparsity level. By establishing a tighter theoretical upper bound, this work provides a rigorous foundation for evaluating and stabilizing future sparse attention systems.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 88
Loading