Keywords: In-Context Learning, Video Generation, Robotic
Abstract: Robots that can learn directly from human demonstration videos promise scalable cross-task and cross-environment generalization, yet existing approaches rely on intermediate representations such as keypoints or trajectories, losing critical spatio-temporal detail and suffering from cumulative error. We introduce Mitty, a Diffusion Transformer framework that enables video In-Context Learning for end-to-end human-to-robot video generation. Mitty leverages the powerful visual and temporal priors of the pretrained Wan 2.2 video model, compressing human demonstration videos into condition tokens and fusing them with robot denoise tokens through bidirectional attention during diffusion. This design bypasses explicit action labels and intermediate representations, directly translating human actions into robotic executions. We further mitigate data scarcity by synthesizing high-quality paired videos from large egocentric datasets. Experiments on the Human-to-Robot and EPIC-Kitchens datasets show that Mitty achieves state-of-the-art performance, strong generalization to unseen tasks and environments, and new insights for scalable robot learning from human demonstrations.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 5223
Loading