A Universal World Model Learned from Large Scale and Diverse Videos

Published: 07 Nov 2023, Last Modified: 04 Dec 2023FMDM@NeurIPS2023EveryoneRevisionsBibTeX
Keywords: World model, Large scale video pretrain, Generalization
TL;DR: We propose a generalizable world model pre-tained by large scale and diverse video dataset and then fine-tune the world model to obtain an accurate dynamic function in a sample-efficient manner.
Abstract: World models play a crucial role in model-based reinforcement learning (RL) by providing predictive representations of an agent in an environment and enabling the agent to reason about the future and make more informed decisions. However, there are still two main problems limiting the applications of world models. First, current methods typically train the world models using only massive domain-specific data, making it challenging to generalize to unseen scenarios or adapt to changes in the environments. Second, it is difficult to define the actions when world models are trained using in the wild videos. In this work, we tackle these two problems by learning a general purpose world model from a diverse and large scale real world video dataset with extracted latent actions. Specifically, our approach leverages a pre-trained vision encoder to project the images of two adjacent frames into states; then, extracts the latent actions into a low dimensional space based on vector quantization; finally, a dynamic function is learned using latent actions. Results show that the proposed generic world model can successfully extract latent actions of arbitrary neighboring frames when testing on in the wild video dataset. Furthermore, fine-tuning on only a small amount of in-domain data can significantly improve the accuracy of the generic world model when adapting to unseen environments.
Submission Number: 13