Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models

Published: 24 Apr 2024, Last Modified: 15 May 2024ICRA 2024 Workshop on 3D Visual Representations for Robot ManipulationEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Foundation Models, Language-conditioned Imitation Learning, Zero-shot Learning
TL;DR: A novel framework to label unucrated long-horizon robot demonstrations without any model training our human annotation for langauge-conditioned policy learning.
Abstract: A central challenge towards developing robots that can relate human language to their perception and actions is the scarcity of natural language annotations in diverse robot datasets. Moreover, robot policies that follow natural language instructions are typically trained on either templated language or expensive human-labeled instructions, hindering their scalability. To this end, we introduce a novel approach to automatically label uncurated, long-horizon robot teleoperation data at scale in a zero-shot manner without any human intervention. We utilize a combination of pre-trained vision-language foundation models to detect objects in a scene, propose possible tasks, segment tasks from large datasets of unlabelled interaction data and then train language-conditioned policies on the relabeled datasets. Our initial experiments show that our method enables training language-conditioned policies on unlabeled and unstructured datasets that match ones trained with oracle human annotations.
Submission Number: 12
Loading