Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu; Yanjiang Guo; Pengchao Wang; Xiaoyu Chen; Yen-Jen Wang; Jianke Zhang; Koushil Sreenath; Chaochao Lu; Jianyu Chen

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, Jianyu Chen

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a generalist robot policy, Video Prediction Policy which conditioned on visual representaions inside video diffusion models.

Abstract: Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data. In experiments, VPP achieves a 18.6\% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6\% increase in success rates for complex real-world dexterous manipulation tasks. For your convenience, videos can be found at https://video-prediction-policy.github.io/

Lay Summary: In this work, we propose the Video Prediction Policy (VPP), a generalist robotic policy conditioned on the predictive visual representations from Video diffusion models. VPP implicitly learns inverse dynamics conditioned on these predictive representations, leading to consistent performance gains in both simulated and real-world environments. We also demonstrate the benefit of utilizing physical knowledge embedded in pre-trained video generation models and large-scale Internet manipulation datasets. Our results underscore the potential of video models in enabling physical intelligence and highlight their value in embodied robotic tasks.

Link To Code: https://github.com/roboterax/video-prediction-policy

Primary Area: Applications->Robotics

Keywords: Robot policy learning, diffusion model, inverse dynamics model

Submission Number: 10019

Loading