Vision-Language Models Provide Promptable Representations for Reinforcement Learning

William Chen; Oier Mees; Aviral Kumar; Sergey Levine

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

William Chen, Oier Mees, Aviral Kumar, Sergey Levine

Published: 07 Nov 2023, Last Modified: 01 Dec 2023FMDM@NeurIPS2023EveryoneRevisionsBibTeX

Keywords: Reinforcement Learning, Promptable Representations, Vision-language Models, Embodied Agents, Minecraft

TL;DR: We use the prior knowledge in vision language models for reinforcement learning by extracting promptable representations: grounded semantic features useful for control tasks that are elicited by prompting with task context and auxiliary information.

Abstract: Intelligent beings have the ability to quickly learn new behaviors and tasks by leveraging background world knowledge. This stands in contrast to most agents trained with reinforcement learning (RL), which typically learn behaviors from scratch. Therefore, we would like to endow RL agents with a similar ability to leverage contextual prior information. To this end, we propose a novel approach that uses the vast amounts of general-purpose, diverse, and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data to generate text in response to images and prompts. We initialize RL policies with VLMs by using such models as sources of \textit{promptable representations}: embeddings that are grounded in visual observations and encode semantic features based on the VLM's internal knowledge, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex RL tasks in Minecraft. We find that policies trained on promptable embeddings significantly outperform equivalent policies trained on generic, non-promptable image encoder features. Moreover, we show that promptable representations extracted from general-purpose VLMs outperform both domain-specific representations and instruction-following methods. In ablations, we find that VLM promptability and text generation both are important in yielding good representations for RL. Finally, we give a simple method for evaluating and optimizing prompts used by our approach for a given task without running expensive RL trials, ensuring that it extracts task-relevant semantic features from the VLM.

Submission Number: 66

Loading