AVID: Adapting Video Diffusion Models to World Models

Published: 09 May 2025, Last Modified: 09 May 2025RLC 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: world models, diffusion models, model-based reinforcement learning, black-box adaptation
TL;DR: Adapting pretrained video models to world models (suitable for model-based RL) without finetuning.
Abstract: Reinforcement learning (RL) is highly effective in domains that can be easily simulated. However, in problems such as robotic manipulation, accurate simulation is challenging and gathering large amounts of real-world data is impractical. A potential solution lies in leveraging widely-available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to generate synthetic data to optimize decision-making via RL. Image-to-video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action-conditioned, and the most powerful models are closed-source which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model. Our approach, AVID, trains an adapter on a small domain-specific dataset of action-labelled videos. AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos. We evaluate AVID on video game and real-world robotics data, and show that it generally outperforms baselines for diffusion adaptation in video and image metrics. AVID demonstrates that pretrained video models have the potential to be powerful tools for generating synthetic data for RL agents. In future work, we wish to investigate how the improved data generation accuracy translates to model-based RL performance.
Submission Number: 64
Loading