RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer

16 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Robotic, Video Generation, Imitation Learning, Data Augmentation
TL;DR: RoboTransfer is a video synthesis framework for robotic manipulation that ensures multi-view consistency while enabling fine-grained, disentangled control.
Abstract: The goal of general-purpose robotics is to create agents that can seamlessly adapt to and operate in diverse, unstructured human environments. Imitation learning has become a key paradigm for robotic manipulation, yet collecting large-scale and diverse demonstrations is prohibitively expensive. Simulators provide a cost-effective alternative, but the sim-to-real gap remains a major obstacle to scalability. We present RoboTransfer, a diffusion-based video generation framework for synthesizing robotic data. By leveraging cross-view feature interactions and globally consistent 3D geometry, RoboTransfer achieves multi-view geometric consistency while enabling fine-grained control over scene elements, including background editing and object replacement. Experiments show that RoboTransfer generates videos with improved geometric consistency and visual fidelity, and that policies trained on this data generalize better to novel, unseen scenarios. The code and datasets will be released upon acceptance.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 6846
Loading