Keywords: generalization, robotic manipulation, language augmentation
TL;DR: We present STEER, a framework that extracts flexible low-level skills from existing datasets that can be combined by humans or VLMs to handle more complex situations without any additional data collection or fine-tuning.
Abstract: Recent advances have showcased the opportunity of leveraging the broad semantic understanding learned by vision-language models (VLMs) in robot learning; however, connecting VLMs effectively to robot control remains an open question since physical robot data is relatively sparse and narrow compared to internet-scale VLM training data.
We propose STEER, a system for bridging this gap by learning flexible, low-level manipulation skills that can be modulated or repurposed to adapt to new situations. We show that training low-level learned policies on structured, dense re-annotation of existing robot datasets exposes an intuitive and flexible interface for humans or VLMs to guide them in unfamiliar scenarios or to perform new tasks using common-sense reasoning. We demonstrate the skills learned via STEER can be combined to synthesize novel behaviors to achieve held-out tasks without additional training. Videos at https://steer-anon.github.io/
Submission Number: 37
Loading