LIV: Language-Image Representations and Rewards for Robotic ControlDownload PDF

14 Apr 2023 (modified: 22 Oct 2023)Submitted to ICRA-23 Workshop on Pretraining4RoboticsReaders: Everyone
Keywords: Pre-Training, Fine-Tuning, Reward Learning, Language, Multi-Modal Representation
Abstract: Motivated by the growing research in natural language-based task interfaces for robotic tasks, we seek good vision-language representations specialized for control. We posit that such representations should: (1) align the two modalities to permit grounding language-based task specifications in visual state-based task rewards, (2) capture sequentiality and task-directed progress in conjunction with cross-modality alignment, and (3) permit extensive pre-training from large generic datasets as well as fine-tuning on small in-domain datasets. We achieve these desiderata through \textbf{L}anguage-\textbf{I}mage \textbf{V}alue learning (LIV), a unified objective for vision-language representation and reward learning from action-free videos with text annotations. We use LIV to pre-train the first control-centric vision-language representation from large human video datasets such as EpicKitchen with no action information. Then, with access to target domain data, the very same objective consistently improves this pre-trained LIV model as well as other pre-existing vision-language representations for language-conditioned control. On two simulated robot domains that evaluate vision-language representations and rewards, LIV pre-trained and fine-tuned models consistently outperform the best prior approaches, establishing the advantages of joint vision-language representation and reward learning within its unified, compact framework.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/arxiv:2306.00958/code)
0 Replies

Loading