R+X: Retrieval and Execution from Everyday Human Videos

Published: 26 Jun 2024, Last Modified: 09 Jul 2024DGR@RSS2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: learning from observation, foundation models
TL;DR: We learn skills from long, unlabelled, first person videos using foundation models to both retrieve demos and compute actions.
Abstract: We present R+X, a framework which enables robots to learn skills from long, unlabelled first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then conditions an in-context imitation learning technique on this behaviour to execute the skill. By leveraging a VLM for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos are available at this website: https://sites.google.com/view/r-plus-x.
Submission Number: 23
Loading