One-Shot Imitation under Mismatched Execution

Published: 07 May 2025, Last Modified: 07 May 2025ICRA Workshop Human-Centered Robot LearningEveryoneRevisionsBibTeXCC BY 4.0
Workshop Statement: Our work addresses a key challenge in human-centered robot learning: enabling robots to learn from unpaired human demonstration videos, despite differences in embodiment and execution style. We introduce RHyME, a framework for one-shot imitation that uses sequence-level retrieval to align unpaired human and robot trajectories without requiring matched demonstrations. This allows a robot to "imagine" how a human video maps to its own capabilities, enabling policy learning from weakly structured human data. This approach aligns closely with the workshop’s goals of scaling robot learning with large, unstructured datasets and grounding robot behavior in human-centered representations. By reasoning over entire demonstration sequences rather than frame-level visual similarity, RHyME facilitates robust imitation in real-world, long-horizon tasks.
Keywords: Imitation Learning, Cross-Embodiment Learning, Robot Manipulation
TL;DR: We enable one-shot robot imitation of human videos by retrieving task-aligned snippets from unpaired demonstrations, even under mismatched execution.
Abstract: Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks. However, translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities. Existing methods for human-robot translation either depend on paired data, which is infeasible to scale, or rely heavily on frame-level visual similarities that often break down in practice. To address these challenges, we propose RHyME, a novel framework that automatically pairs human and robot trajectories using sequence-level optimal transport cost functions. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent human videos by retrieving and composing short-horizon human clips. This approach facilitates effective policy training without the need for paired data. RHyME successfully imitates a range of cross-embodiment demonstrators, both in simulation and with a real human hand, achieving over 50% increase in task success compared to previous methods. We release our code and datasets at on this [website](https://portal-cornell.github.io/rhyme/).
Submission Number: 12
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview