ROSE: Reconstructing Objects, Scenes, and Trajectories from Casual Videos for Robotic Manipulation

Peihao Li; Haoran Geng; Jameson Crate; Yanbing Han; Junyi Zhang; Feishi Wang; Charlie Tianyue Cheng; Runpei Dong; Yen-Jen Wang; Haozhe Lou; Trevor Darrell; Pieter Abbeel; Jitendra Malik

ROSE: Reconstructing Objects, Scenes, and Trajectories from Casual Videos for Robotic Manipulation

Peihao Li, Haoran Geng, Jameson Crate, Yanbing Han, Junyi Zhang, Feishi Wang, Charlie Tianyue Cheng, Runpei Dong, Yen-Jen Wang, Haozhe Lou, Trevor Darrell, Pieter Abbeel, Jitendra Malik

Published: 17 Sept 2025, Last Modified: 17 Sept 2025H2R CoRL 2025 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Real-to-sim-to-real, 3D Reconstruction, Manipulation

Abstract: In this paper, we build a real-to-sim-to-real (Real2Sim2Real) system for robot manipulation policy learning from casual human videos. We propose a new framework, ROSE, that directly leverages casual videos to reconstruct simulator-ready assets, including objects, scenes, and object trajectories, for training manipulation policies with reinforcement learning in the simulation. Unlike existing real-to-sim pipelines that rely on specialized equipment or time-consuming and labor-intensive human annotation, our pipeline is equipment-agnostic and fully automated, facilitating data collection scalability. From casual monocular videos, ROSE enables the direct reconstruction of metric-scale scenes, objects, and object trajectories in the same gravity-calibrated coordinate for robotic data collection in the simulator. With ROSE, we curate a dataset of simulator-ready scenes from casual videos from our own capture and the Internet, and create a benchmark for real-to-sim evaluation. Across a diverse suite of manipulation tasks, ROSE outperforms the existing baselines, laying the groundwork for scalable robotic data collection and achieving efficient Real2Sim2Real deployment.

Submission Number: 22

Loading