Vision-Language Models Provide Promptable Representations for Reinforcement Learning

William Chen; Oier Mees; Aviral Kumar; Sergey Levine

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

William Chen, Oier Mees, Aviral Kumar, Sergey Levine

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Reinforcement Learning, Promptable Representations, Vision-language Models, Embodied Control

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We use the prior knowledge in vision language models for reinforcement learning by extracting promptable representations: grounded semantic features useful for control tasks that are elicited by prompting with task context and auxiliary information.

Abstract: Intelligent beings have the ability to quickly learn new behaviors and tasks by leveraging background world knowledge. This stands in contrast to most agents trained with reinforcement learning (RL), which typically learn behaviors from scratch. Therefore, we would like to endow RL agents with a similar ability to leverage contextual prior information. To this end, we propose a novel approach that uses the vast amounts of general-purpose, diverse, and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data to generate text in response to images and prompts. We initialize RL policies with VLMs by using such models as sources of \textit{promptable representations}: embeddings that are grounded in visual observations and encode semantic features based on the VLM's internal knowledge, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex RL tasks in Minecraft. We find that policies trained on promptable embeddings significantly outperform equivalent policies trained on generic, non-promptable image encoder features. Moreover, we show that promptable representations extracted from general-purpose VLMs outperform both domain-specific representations and instruction-following methods. In ablations, we find that VLM promptability and text generation both are important in yielding good representations for RL. Finally, we give a simple method for evaluating and optimizing prompts used by our approach for a given task without running expensive RL trials, ensuring that it extracts task-relevant semantic features from the VLM.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6621

Loading