everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Recent successes in autoregressive (AR) generation models, such as the GPT series in natural language processing, have motivated efforts to replicate this success in visual tasks. By leveraging the next-token prediction strategy, GPT-style models can forecast future events from past data. Some research aims to extend this approach to autonomous driving by building video-based world models capable of generating realistic future video sequences and predicting the ego state. However, the prior works tend to produce unsatisfactory results, since the classic GPT framework is designed to handle 1D contextual information, such as text, and lacks the inherent capability to model the spatial and temporal dynamics necessary for video generation. In this paper, we present DrivingWorld, a video-based world model for autonomous driving via a new GPT structure with spatial-temporal design. The key idea is to disentangle temporal and spatial information in the generation. Specifically, we first propose next-frame-prediction strategy to model temporal coherence between consecutive frames and then apply next-token-prediction strategy to capture spatial information within a frame. With the hybrid design, our work is capable of producing high-fidelity and consistent video clips with long-time duration. Experiments show that compared to the prior works, our method presents better quality of visual effects and more accurate controllable future video generation.