Keywords: Object-centric Imitation Learning, Keypoint Representations, Robotic Manipulation
TL;DR: We investigate how semantic keypoints can be used to improve the generalization of imitation learning for robotic manipulation.
Abstract: RGB-based imitation learning requires many demonstrations to generalize to unseen objects or scenes, motivating research into intermediate representations to improve generalization for robotic manipulation. Vision foundation models enable one-shot extraction of keypoints to provide such a representation. However, how to optimally integrate keypoints into imitation learning and when they outperform alternative representations remains unclear. We systematically study design choices in keypoint imitation learning (KIL), thereby consolidating insights from prior work into practical guidelines. Evaluating over 2000 real-world rollouts across five tasks and diverse scene variations, KIL achieves a 75% overall success rate, substantially outperforming an RGB baseline (47%) and performing similar to S²-diffusion (73%), an object-centric baseline. Finally, we explore the limitations of the foundation models used for keypoint extraction and find that they are sensitive to large variations in object orientation. Our results confirm KIL as a data-efficient approach for robot learning, and suggest directions for future research to improve our understanding of its limitations and potential. Videos of all all 2000 rollouts are available at https://kil-manipulation.github.io/
Submission Number: 21
Loading