Spiral Evolution of Visual World Model: Reclaiming Autoregression from the Diffusion Era

15 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025 Position Paper TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interactive World Model; Autoregressive Prediction; Video Generation; Multimodal-Controlled Video Generation
TL;DR: We argue for a return to autoregressive models for video generation, highlighting their advantages in real-time, interactive world modeling.
Abstract: Recent advances in video generation have been dominated by diffusion-based models, which produce high-quality, prompt-faithful sequences through holistic denoising. While this paradigm has achieved striking visual fidelity, it falls short for real-time, interactive applications that require frame-level responsiveness and causal coherence—cornerstones of practical world modeling. In this position paper, we advocate for a strategic return to autoregressive generation as the foundational architecture for building interactive world simulators. We argue that beyond offering faster inference, autoregressive models bring critical structural advantages: they naturally support predictive compression, enable causal disentanglement, and offer a more responsive mechanism for integrating control signals in dynamic settings. Unlike language-conditioned diffusion models, autoregression flexibly accommodates frame-wise control inputs such as camera motion and joint actions, making it ideally suited for agent-centric simulation. We further highlight emerging techniques and promising directions—including selective denoising, adaptive resolution, and postdictive coding—that address historical limitations of autoregression and unlock new levels of interactivity. We contend that embracing autoregression will be essential for developing practical, controllable, and truly intelligent world models.
Submission Number: 293
Loading