Talk Through It: End User Directed Manipulation Learning

Published: 05 Apr 2024, Last Modified: 15 Jul 2024VLMNM 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Robotic manipulation, natural language, behavior cloning, VLM, end user
TL;DR: We present a framework that allows end users to instruct robots to complete tasks, and save the tasks as demonstrations for behavior cloning.
Abstract: Training robots to perform a huge range of tasks in many different environments is immensely difficult. Instead, we propose selectively training robots based on end-user preferences. Given a vision and language conditioned factory model that lets an end user instruct a robot to perform lower-level actions (e.g. ‘Move left’), we show that end users can collect demonstrations using language to train their home model for higher-level tasks specific to their needs (e.g. ‘Open the top drawer and put the block inside’). Our method results in a 13% improvement in task success rates compared to a baseline method. We also explore the use of the large vision-language model (VLM), Bard, to automatically break down tasks into sequences of lower-level instructions, aiming to bypass end-user involvement. The VLM is unable to break tasks down to our lowest level, but does achieve good results breaking high-level tasks into mid-level skills. We have a supplemental video and additional results at talk-through-it.github.io.
Submission Number: 34
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview