CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation LearningDownload PDF

Published: 17 Nov 2022, Last Modified: 05 May 2023PRL 2022 PosterReaders: Everyone
Keywords: Imitation Learning, Representation learning, Vision-based robot manipulation
TL;DR: We propose a framework, namely CACTI, for scalable multi-task, multi-scene visual imitation learning. CACTI leverages pretrained visual representations and generative models to compress and augment the visual observations.
Abstract: Developing robots that possess a diverse repertoire of behaviors and exhibit generalization in unknown scenarios requires progress on two fronts: efficient collection of large-scale and diverse datasets, and training of high-capacity policies on the collected data. While large and diverse datasets unlock generalization capabilities, like that observed in computer vision and natural language processing, collection of such datasets is particularly challenging for physical systems like robotics. In this work, we propose a framework to bridge this gap and scale robot learning, under the lens of multi-task, multi-scene robot manipulation in kitchen environments. Our framework, named CACTI, has four stages that separately handle data collection, data augmentation, visual representation learning, and imitation policy training. We demonstrate that, in a simulated kitchen environment, CACTI enables training a single policy on 18 semantic tasks across up to 50 layout variations per task. When instantiated on a real robot setup, CACTI results in a policy capable of 5 manipulation tasks involving kitchen objects, and robust to varying distractor layouts. The simulation task benchmark and augmented datasets in both real and simulated environments will be released to facilitate future research.
1 Reply