VideoAgent: Self-Improving Video Generation for Embodied Planning

Achint Soni; Sreyas Venkataraman; Abhranil Chandra; Sebastian Fischmeister; Percy Liang; Bo Dai; Sherry Yang

VideoAgent: Self-Improving Video Generation for Embodied Planning

Achint Soni, Sreyas Venkataraman, Abhranil Chandra, Sebastian Fischmeister, Percy Liang, Bo Dai, Sherry Yang

Published: 01 Jul 2025, Last Modified: 01 Jul 2025RLBrew: Ingredients for Developing Generalist Agents workshop (RLC 2025)EveryoneRevisionsBibTeXCC BY 4.0

Keywords: planning, reinforcement learning, sequential decision making, video generation, self improvement

TL;DR: We propose VideoAgent to self improve video generation by refining video plans using external feedback, significantly reducing hallucinations and enhancing task success in robotic manipulation tasks.

Abstract: Video generation has been used to generate visual plans for controlling robotic systems. Given an image observation and a language instruction, previous work has generated video plans which are then converted to robot controls to be executed. However, a major bottleneck in leveraging video generation for control lies in the quality of the generated videos, which often suffer from hallucinatory content and unrealistic physics, resulting in low task success when control actions are extracted from the generated videos. While scaling up dataset and model size provides a partial solution, integrating external feedback is both natural and essential for grounding video generation in the real world. With this observation, we propose VideoAgent for self-improving generated video plans based on external feedback. VideoAgent trains a video diffusion model to perform video refinement through a novel objective which we call self-conditioning consistency. During inference, VideoAgent samples and refines a generated video plans under the guidance of a vision-language model (VLM) as reward, enabling inference-time compute to be turned into better generated video plans. As the refined video plan is being executed, VideoAgent can collect additional data from the environment to further improve video plan generation. Experiments in simulated robotic manipulation from MetaWorld and iTHOR show that VideoAgent drastically reduces hallucination, thereby boosting success rate of downstream manipulation tasks. We further illustrate that VideoAgent can effectively refine real-robot videos, providing an early indicator that robots can be an effective tool in grounding video generation in the physical world.

Submission Number: 13

Loading