Keywords: Imitation Learning, 3D Vision, Long-horizon tasks
Abstract: Learning long-horizon manipulation skills from high-dimensional demonstrations is a critical but challenging task for robots today. Recently, several methods have been proposed to learn such skills by extracting keyframes from the demonstration, learning to predict the robot gripper's pose for the next-closest keyframe, and using motion planning or a learned low-level policy to transition between keyframes. However, these methods can suffer from imprecision and sensitivity to initial conditions. Concurrently, several promising algorithms have been proposed to efficiently learn precise object-object relationships from a small number of demonstrations, enabling generalization to new objects and new configurations. In this work, we introduce a new framework that adapts these object-object relational methods for long-horizon imitation learning. By predicting desired object-object relationships at each keyframe - instead of predicting only a gripper position - our method is able to learn precise sequences of object rearrangements that can be executed by a robot. We demonstrate the power of this technique by evaluating on the RLBench10 suite of tasks, where we achieve state-of-the-art success rates on the benchmark. Supplementary materials are available on our [website](https://sites.google.com/view/taxpolicy-corl-2024/home).
Submission Number: 10
Loading