Keywords: imitation learning, robotics, manipulation, learning from human demonstrations, learning from observations, generalization, visuomotor control
TL;DR: We leverage hand-centric human video demonstrations to learn generalizable robotic manipulation policies via imitation learning, introducing a simple framework that allows one to avoid using explicit human-robot domain adaptation methods.
Abstract: Videos of humans performing tasks are a promising data source for robotic manipulation because they are easy to collect in a wide range of scenarios and thus have the potential to significantly expand the generalization capabilities of vision-based robotic manipulators. Prior approaches to learning from human video demonstrations typically use third-person or egocentric data, but a central challenge that must be overcome there is the domain shift caused by the difference in appearance between human and robot morphologies. In this work, we largely reduce this domain gap by collecting hand-centric human video data (i.e., videos captured by a human demonstrator wearing a camera on their arm). To further close the gap, we simply crop out a portion of every visual observation such that the hand is no longer visible. We propose a framework for broadening the generalization of deep robotic imitation learning policies by incorporating unlabeled data in this format---without needing to employ any domain adaptation method, as the human embodiment is not visible in the frame. On a suite of six real robot manipulation tasks, our method substantially improves the generalization performance of manipulation policies acting on hand-centric image observations. Moreover, our method enables robots to generalize to both new environment configurations and new tasks that are unseen in the expert robot imitation data.
0 Replies
Loading