Keywords: Imitation Learning, Manipulation, Representation Learning
TL;DR: Learn one-shot imitation policies by aligning task-equivalent human-robot video snippets using optimal transport.
Abstract: Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks.
However, translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities. Existing methods either depend on human-robot paired data, which is infeasible to scale, or rely heavily on frame-level visual similarities that often break down in practice. To address these challenges, we propose RHyME, a novel framework that automatically aligns human and robot task executions using optimal transport costs. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent human videos by retrieving and composing short-horizon human clips. This approach facilitates effective policy training without the need for paired data. RHyME successfully imitates a range of cross-embodiment demonstrators, both in simulation and with a real human hand, achieving over 50% increase in task success compared to previous methods. We release our datasets and graphics at https://portal.cs.cornell.edu/rhyme/.
Previous Publication: No
Submission Number: 35
Loading