Imitation Learning from a Single Temporally Misaligned Video

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-SA 4.0
TL;DR: Learning sequential tasks from a temporally misaligned video requires a reward function that measures subgoal ordering and coverage.
Abstract: We examine the problem of learning sequential tasks from a single visual demonstration. A key challenge arises when demonstrations are temporally misaligned due to variations in timing, differences in embodiment, or inconsistencies in execution. Existing approaches treat imitation as a distribution-matching problem, aligning individual frames between the agent and the demonstration. However, we show that such frame-level matching fails to enforce temporal ordering or ensure consistent progress. Our key insight is that matching should instead be defined at the level of sequences. We propose that perfect matching occurs when one sequence successfully covers all the subgoals in the same order as the other sequence. We present ORCA (ORdered Coverage Alignment), a dense per-timestep reward function that measures the probability of the agent covering demonstration frames in the correct order. On temporally misaligned demonstrations, we show that agents trained with the ORCA reward achieve $4.5$x improvement ($0.11 \rightarrow 0.50$ average normalized returns) for Meta-world tasks and $6.6$x improvement ($6.55 \rightarrow 43.3$ average returns) for Humanoid-v4 tasks compared to the best frame-level matching algorithms. We also provide empirical analysis showing that ORCA is robust to varying levels of temporal misalignment. The project website is at https://portal-cornell.github.io/orca/
Lay Summary: Teaching robots new tasks usually requires detailed instructions about what actions are good at every moment, which we call “designing a reward function.” This is difficult and time consuming. An easier alternative is to show the robot a video demonstrating how to solve the task. However, these demonstrations often move at a different speed compared to how the robot can move. This makes it difficult or even impossible for the robot to follow them exactly. We find, both in theory and practice, that traditional methods fail when the demonstration is at a different speed. Our solution is to treat the frames of the video as a sequence of subgoals that the robot must achieve at some point in time instead of matching the timing exactly. Specifically, we define the reward function as how well the robot can match ALL the subgoals in the EXACT SAME order as the video. Then, we can teach the robot to try different actions and repeat actions that have high rewards, a technique known as reinforcement learning. Our work focuses on robot video, but it sets the foundation for learning from human videos, which typically have different speeds from robots.
Link To Code: https://github.com/portal-cornell/orca/
Primary Area: Reinforcement Learning->Inverse
Keywords: Learning from Videos, Inverse Reinforcement Learning, Reward Formulation
Submission Number: 13799
Loading