Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models

Published: 16 Apr 2024, Last Modified: 02 May 2024MoMa WS 2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Foundation Models, Policy Learning, Zero-Shot, Vision-Language Models, Play Data
TL;DR: A novel method to label long-horizon robot play data zero-shot using pretrained frozen foundation models to train language-conditioned policies without requiring any human annotation.
Abstract: A central challenge towards developing robots that can relate human language to their perception and actions is the scarcity of natural language annotations in diverse robot datasets. Moreover, robot policies that follow natural language instructions are typically trained on either templated language or expensive human-labeled instructions, hindering their scalability. To this end, we introduce a novel approach to automatically label uncurated, long-horizon robot teleoperation data at scale in a zero-shot manner without any human intervention. We utilize a combination of pre-trained vision-language foundation models to detect objects in a scene, propose possible tasks, segment tasks from large datasets of unlabelled interaction data and then train language-conditioned policies on the relabeled datasets. Our initial experiments show that our method enables training language-conditioned policies on unlabeled and unstructured datasets that match ones trained with oracle human annotations.
Submission Number: 20
Loading