Empowering World Models with Reflection for Embodied Video Prediction

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Video generation models have made significant progress in simulating future states, showcasing their potential as world simulators in embodied scenarios. However, existing models often lack robust understanding, limiting their ability to perform multi-step predictions or handle Out-of-Distribution (OOD) scenarios. To address this challenge, we propose the Reflection of Generation (RoG), a set of intermediate reasoning strategies designed to enhance video prediction. It leverages the complementary strengths of pre-trained vision-language and video generation models, enabling them to function as a world model in embodied scenarios. To support RoG, we introduce Embodied Video Anticipation Benchmark(EVA-Bench), a comprehensive benchmark that evaluates embodied world models across diverse tasks and scenarios, utilizing both in-domain and OOD datasets. Building on this foundation, we devise a world model, Embodied Video Anticipator (EVA), that follows a multistage training paradigm to generate high-fidelity video frames and apply an autoregressive strategy to enable adaptive generalization for longer video sequences. Extensive experiments demonstrate the efficacy of EVA in various downstream tasks like video generation and robotics, thereby paving the way for large-scale pre-trained models in real-world video prediction applications. The video demos are available at https://sites.google.com/view/icml-eva.
Lay Summary: Imagine a robot that can look at a video and predict what happens next — not just the next frame, but a whole sequence of actions. This kind of prediction is essential for intelligent machines to safely interact with the real world, whether it’s a kitchen robot or a self-driving car. But current video models often struggle when faced with complex tasks or unfamiliar situations. To solve this, we introduced a new method called Reflection of Generation (RoG). This approach helps the model “reflect” — that is, reason through intermediate steps — before making predictions. We combine strengths from two powerful technologies: video generation and vision-language models (like those used in image captioning). We also built EVA-Bench, a large-scale benchmark that helps evaluate how well these models work in both common and unusual environments. Our new model, EVA, uses this system to generate longer, more accurate video predictions. This work brings us closer to building machines that can understand and anticipate the physical world, with applications in robotics, virtual assistants, and immersive simulations.
Primary Area: Applications->Robotics
Keywords: World Model, Generation Model, Robotics
Submission Number: 2881
Loading