EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer

17 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion, VLA
Abstract: Vision–language–action (VLA) models increasingly rely on diverse training data to achieve robust generalization. However, collecting large-scale real-world robot manipulation data across varied object appearances and environmental conditions remains prohibitively time-consuming and expensive. To overcome this bottleneck, we propose \textbf{E}mbodied \textbf{M}anipulation \textbf{M}edia \textbf{A}daptation (\textit{EMMA}), a VLA policy enhancement framework that integrates a generative data engine with an effective training pipeline. We introduce \textit{DreamTransfer}, a diffusion Transformer-based framework for generating multi-view consistent, geometrically grounded embodied manipulation videos. \textit{DreamTransfer} enables text-controlled visual editing of robot videos, transforming foreground, background, and lighting conditions without compromising 3D structure or geometrical plausibility. Furthermore, we explore hybrid training with real and generated data, and introduce \textit{AdaMix}, a hard-sample-aware training strategy that dynamically reweights training batches to focus optimization on perceptually or kinematically challenging samples. Extensive experiments show that videos generated by \textit{DreamTransfer} significantly outperform prior video generation methods in multi-view consistency, geometric fidelity, and text-conditioning accuracy. Crucially, VLAs trained with generated data enable robots to generalize to unseen object categories and novel visual domains using only demonstrations from a single appearance. In real-world robotic manipulation tasks with zero-shot visual domains, our approach achieves over a 200\% relative performance gain compared to training on real data alone, and further improves by 13\% with \textit{AdaMix}, demonstrating its effectiveness in boosting policy generalization.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 9594
Loading