Fine-Grained Object Detection and Manipulation with Segmentation-Conditioned Perceiver-Actor

Shogo Akiyama; Dan Ogawa Lillrank; Kai Arulkumaran

Fine-Grained Object Detection and Manipulation with Segmentation-Conditioned Perceiver-Actor

Shogo Akiyama, Dan Ogawa Lillrank, Kai Arulkumaran

Published: 07 May 2023, Last Modified: 11 May 2023ICRA-23 Workshop on Pretraining4Robotics LightningReaders: Everyone

Keywords: imitation learning, segmentation, voxels

TL;DR: We improve the performance of an text-conditioned imitation learning algorithm with voxel representations using pretrained segmentation and text-image models

Abstract: Prior work has shown the benefits of using a 3D representation spacee—in particular, voxelse—for 3D manipulation tasks. However, computation with voxels requires $N^3$ memory, which limits the possible observation size. While the structure of voxels convey spatial information, limited resolution can obscure semantically-relevant information. In this work, we show this can be overcome by conditioning a 3D-based agent, Perceiver-Actor, on additional segmentation information, which allows it to successfully distinguish between similar objects for manipulation tasks. This is achieved by using pretrained segmentation and text-image models to extract segmentation masks for relevant objects in a zero-shot manner. We demonstrate our model on a real robot, where we show it can correctly interact with objects with fine-grained differences, such as a "Cola" can versus a "Dr. Pepper" can.

0 Replies

Loading