iVideoGPT: Interactive VideoGPTs are Scalable World Models

Jialong Wu; Shaofeng Yin; Ningya Feng; Xu He; Dong Li; Jianye HAO; Mingsheng Long

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye HAO, Mingsheng Long

Published: 25 Sept 2024, Last Modified: 14 Jan 2025NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: world model, model-based reinforcement learning, video prediction, visual planning

TL;DR: We propose iVideoGPT, an autoregressive transformer architecture for scalable world models, pre-train it on millions of trajectories and adapt it to a wide range of tasks, including video prediction, visual planning, and model-based RL.

Abstract: World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals—visual observations, actions, and rewards—into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications. Code and pre-trained models are available at https://thuml.github.io/iVideoGPT.

Primary Area: Reinforcement learning

Submission Number: 1179

Loading