Dyn-VPP: Video Prediction Policy Optimization for Improved Visual Dynamics

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Video action models are a promising foundation for Vision–Language–Action (VLA) because they can learn rich visual dynamics directly from video. However, likelihood-oriented training of diffusion predictors emphasizes globally plausible futures and does not guarantee precision-critical visual dynamics needed for manipulation, so small prediction errors can be amplified by downstream policies. We propose Dyn-VPP, a post-training framework that casts multi-step denoising as policy optimization and aligns predicted future latents with expert visual dynamics via verifiable terminal reward, without modifying any architecture. This enables explicit optimization of dynamics signals that are not captured by likelihood-only training. As a result, Dyn-VPP yields more accurate visual dynamics and improves downstream task execution. Experiments across diverse simulated and real-world manipulation settings show improved dynamics consistency and consistently higher task success.
Lay Summary: Videos are excellent training resources for teaching robotic vision systems to understand visual changes and perform real-world physical tasks. Current video prediction models that learn from videos can generate realistic future visual scenes overall, but they often make small, subtle visual mistakes when learning precise movement rules for robotic manipulation tasks. These tiny errors will keep expanding during task execution, eventually leading to failure at practical operation tasks. To fix this problem, we developed a simple and effective post-training method named Dyn-VPP. This approach does not require changing the original model structure at all. Instead, it optimizes the model’s ability to predict visual changes by referencing expert-level accurate visual movement rules, and uses clear evaluation feedback to ensure the model learns precise and reliable visual motion patterns. With this design, our method effectively corrects the inaccurate movement understanding flaws of traditional video-trained models. It enables the model to predict more accurate and consistent visual scene changes, and greatly improves the task success rate of robotic systems when completing various simulated and real robotic manipulation tasks.
Primary Area: Applications->Robotics
Keywords: (vision-language-action) VLA models
Originally Submitted PDF: pdf
Submission Number: 21340
Loading