DSA: Efficient Inference For Video Generation Models via Distributed Sparse Attention

DSA: Efficient Inference For Video Generation Models via Distributed Sparse Attention

ICLR 2026 Conference Submission19434 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Distributed System, Diffusion, Inference, Sparsity

TL;DR: DSA introduces a training-free sparse attention with distributed inference for diffusion-based video generation, cutting redundant computation and achieving up to 10.55× faster inference.

Abstract: Diffusion Transformer models have driven the rapid advances in video generation, achieving state-of-the-art quality and flexibility. However, their attention mechanism remains a major performance bottleneck, as its dense computation scales quadratically with the sequence length. To overcome this limitation and reduce the generation latency, we propose DSA, a novel attention mechanism that integrates sparse attention with distributed inference for diffusion-based video generation. By leveraging carefully-designed parallelism strategies and scheduling, DSA significantly reduces redundant computation while preserving global context. Extensive experiments on benchmark datasets demonstrate that, when deployed on 8 GPUs, DSA achieves up to 1.43× inference speedup than the existing distributed method and 10.79× faster than single-GPU inference.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 19434

Loading