Fast Autoregressive Video Generation with Diagonal Decoding

18 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Autoregressive Video Generation, Accelerate
TL;DR: Diagonal Decoding introduces a training-free method to accelerate autoregressive video generation models, achieving up to 10x speedup.
Abstract: Autoregressive Transformer has demonstrated impressive performance in generation models. However, their sequential, token-by-token decoding becomes a severe bottleneck for video generation, which may require generating tens of thousands of tokens sequentially. In this paper, we introduce Diagonal Decoding (DiagD), a training-free inference acceleration algorithm that exploits spatiotemporal correlations to speed up autoregressively pre-trained models. DiagD generates tokens simultaneously along diagonal trajectories in the spatial-temporal token grid, enabling parallel decoding within frames and partial overlap across successive frames. The proposed algorithm is versatile and adaptive to various generative models and tasks and offers adjustable trade-offs between speed and visual quality. Furthermore, we propose a cost-effective fine-tuning strategy that aligns the attention patterns of the model with the new decoding order to show the potential of training with DiagD. Experiments on several autoregressive video generation models and datasets demonstrate that DiagD achieves up to **10x** speed-up over naive sequential decoding, while preserving comparable visual fidelity.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 12498
Loading