Contrast, Imitate, Adapt: Learning Robotic Skills From Raw Human Videos

Published: 01 Jan 2025, Last Modified: 11 Apr 2025IEEE Trans Autom. Sci. Eng. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Learning robotic skills from raw human videos remains a non-trivial challenge. Previous works tackled this problem by leveraging behavior cloning or learning reward functions from videos. Despite their remarkable performances, they may introduce several issues, such as the necessity for robot actions, requirements for consistent viewpoints and similar layouts between human and robot videos, as well as low sample efficiency. To this end, our key insight is to learn task priors by contrasting videos and to learn action priors through imitating trajectories from videos, and to utilize the task priors to guide trajectories to adapt to novel scenarios. We propose a three-stage skill learning framework denoted as Contrast-Imitate-Adapt (CIA). An interaction-aware alignment transformer is proposed to learn task priors by temporally aligning video pairs. Then a trajectory generation model is used to learn action priors. To adapt to novel scenarios different from human videos, the Inversion-Interaction method is designed to initialize coarse trajectories and refine them by limited interaction. In addition, CIA introduces an optimization method based on semantic directions of trajectories for interaction security and sample efficiency. The alignment distances computed by IAAformer are used as the rewards. We evaluate CIA in six real-world everyday tasks, and empirically demonstrate that CIA significantly outperforms previous state-of-the-art works in terms of task success rate and generalization to diverse novel scenarios layouts and object instances.Note to Practitioners—This work aims to study robot skill learning from raw human videos. Compared with teleoperation or kinesthetic teaching in the laboratory, such learning method can flexibly utilize large-scale human videos available on the Internet, thereby improving the robot’s ability to generalize to various complex scenarios. Previous works on learning from videos usually have some issues, including requirements for robot actions, consistent viewpoints, similar layouts and low sample efficiency. To alleviate these issues, we propose a three-stage skill learning framework CIA. Temporal alignment is utilized to learn task priors through our proposed transformer-based model and self-supervised loss functions. A trajectory generation model is trained to learn the action priors. To further adapt to diverse scenarios, we propose a two-stage policy improvement method by initialization and interaction. An optimization method is introduced to ensure safe interaction and sample efficiency, where the optimization objective is guided by the learned task priors. The experimental results show that our CIA outperforms other state-of-the-art methods in task success rate and generalization to novel scenarios.
Loading