DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT

Xiaotao Hu; Wei Yin; Mingkai Jia; Junyuan Deng; Xiaoyang Guo; Qian Zhang; Xiaoxiao Long; Ping Tan

DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT

Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, Ping Tan

19 Sept 2024 (modified: 13 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: world model, video generation

Abstract: Recent successes in autoregressive (AR) generation models, such as the GPT series in natural language processing, have motivated efforts to replicate this success in visual tasks. By leveraging the next-token prediction strategy, GPT-style models can forecast future events from past data. Some research aims to extend this approach to autonomous driving by building video-based world models capable of generating realistic future video sequences and predicting the ego state. However, the prior works tend to produce unsatisfactory results, since the classic GPT framework is designed to handle 1D contextual information, such as text, and lacks the inherent capability to model the spatial and temporal dynamics necessary for video generation. In this paper, we present DrivingWorld, a video-based world model for autonomous driving via a new GPT structure with spatial-temporal design. The key idea is to disentangle temporal and spatial information in the generation. Specifically, we first propose next-frame-prediction strategy to model temporal coherence between consecutive frames and then apply next-token-prediction strategy to capture spatial information within a frame. With the hybrid design, our work is capable of producing high-fidelity and consistent video clips with long-time duration. Experiments show that compared to the prior works, our method presents better quality of visual effects and more accurate controllable future video generation.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1834

Loading