Keywords: imitation learning, segmentation, voxels
TL;DR: We improve the performance of an text-conditioned imitation learning algorithm with voxel representations using pretrained segmentation and text-image models
Abstract: Prior work has shown the benefits of using a 3D representation spacee—in particular, voxelse—for 3D manipulation tasks. However, computation with voxels requires $N^3$ memory, which limits the possible observation size. While the structure of voxels convey spatial information, limited resolution can obscure semantically-relevant information. In this work, we show this can be overcome by conditioning a 3D-based agent, Perceiver-Actor, on additional segmentation information, which allows it to successfully distinguish between similar objects for manipulation tasks. This is achieved by using pretrained segmentation and text-image models to extract segmentation masks for relevant objects in a zero-shot manner. We demonstrate our model on a real robot, where we show it can correctly interact with objects with fine-grained differences, such as a "Cola" can versus a "Dr. Pepper" can.
0 Replies
Loading