Keywords: Robotic manipulation, natural language, behavior cloning, VLM, end user
TL;DR: We present a framework that allows end users to instruct robots to complete tasks, and save the tasks as demonstrations for behavior cloning.
Abstract: Training robots to perform a huge range of
tasks in many different environments is immensely difficult.
Instead, we propose selectively training robots based on end-user preferences.
Given a vision and language conditioned factory model
that lets an end user instruct a robot to perform lower-level
actions (e.g. ‘Move left’), we show that end users can collect
demonstrations using language to train their home model for
higher-level tasks specific to their needs (e.g. ‘Open the top
drawer and put the block inside’). Our method results in a
13% improvement in task success rates compared to a baseline
method.
We also explore the use of the large vision-language model
(VLM), Bard, to automatically break down tasks into sequences of lower-level instructions, aiming to bypass end-user
involvement. The VLM is unable to break tasks down to our
lowest level, but does achieve good results breaking high-level
tasks into mid-level skills. We have a supplemental video and
additional results at talk-through-it.github.io.
Submission Number: 34
Loading