Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Published: 28 May 2025, Last Modified: 28 Jun 2025FMEA @ CVPR 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Foundation Models based Robot Manipulation, Vision-based Robotics, Video Generation Models, 6D Pose Estimation
TL;DR: Our method enables robots to execute manipulation tasks purely from generated videos—no real‑world demos needed
Abstract: Can robots perform complex manipulation tasks—such as pouring, wiping, and mixing—purely by leveraging generative models, without requiring any physical demonstrations? Our key insight is that AI-generated videos, when combined with recent advances in computer vision, can serve as a rich and readily available source of supervision. We introduce Robots Imitating Generated Videos (RIGVid), a framework that enables robots to imitate manipulation behaviors from generated videos. Given a language command and an initial scene image, a video diffusion model generates a corresponding video. A 6D pose tracker then extracts object trajectories from the video, which are retargeted to the robot in an embodiment-agnostic fashion. To ensure quality, we propose an automatic filtering mechanism that discards inaccurately generated videos. Through extensive real-world evaluations, we show that filtered generated videos can be as effective as real demonstrations, and that performance improves with video quality. Our method outperforms state-of-the-art VLM-based open-world manipulation approaches by leveraging the rich, fine-grained visual details captured in generated videos, enabling more accurate and precise task execution. We also achieve stronger results than prior methods that use video generation for robotics. These findings suggest that generated videos alone can offer a scalable and effective source of supervision for robotic manipulation.
Publishedpaper: No
Submission Number: 25
Loading