Training robots to perform a huge range of tasks in many different environments is immensely difficult. Instead, we propose selectively training robots based on end-user preferences.
Given a vision and language conditioned factory model that lets an end user instruct a robot to perform lower-level actions (e.g. ‘Move left’), we show that end users can collect demonstrations using language to train their home model for higher-level tasks specific to their needs (e.g. ‘Open the top drawer and put the block inside’). Our method results in a 13% improvement in task success rates compared to a baseline method.
We also explore the use of the large vision-language model (VLM), Bard, to automatically break down tasks into sequences of lower-level instructions, aiming to bypass end-user involvement. The VLM is unable to break tasks down to our lowest level, but does achieve good results breaking high-level tasks into mid-level skills. We have a supplemental video and additional results at talk-through-it.github.io.