Talk Through It: End User Directed Manipulation Learning

Carl Winge; Karthik Desingh; Adam Imdieke; Bahaa Aldeeb; Dongyeop Kang

Talk Through It: End User Directed Manipulation Learning

Carl Winge, Karthik Desingh, Adam Imdieke, Bahaa Aldeeb, Dongyeop Kang

Published: 05 Apr 2024, Last Modified: 15 Jul 2024VLMNM 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Robotic manipulation, natural language, behavior cloning, VLM, end user

TL;DR: We present a framework that allows end users to instruct robots to complete tasks, and save the tasks as demonstrations for behavior cloning.

Abstract: Training robots to perform a huge range of tasks in many different environments is immensely difficult. Instead, we propose selectively training robots based on end-user preferences. Given a vision and language conditioned factory model that lets an end user instruct a robot to perform lower-level actions (e.g. ‘Move left’), we show that end users can collect demonstrations using language to train their home model for higher-level tasks specific to their needs (e.g. ‘Open the top drawer and put the block inside’). Our method results in a 13% improvement in task success rates compared to a baseline method. We also explore the use of the large vision-language model (VLM), Bard, to automatically break down tasks into sequences of lower-level instructions, aiming to bypass end-user involvement. The VLM is unable to break tasks down to our lowest level, but does achieve good results breaking high-level tasks into mid-level skills. We have a supplemental video and additional results at talk-through-it.github.io.

Submission Number: 34

Loading