Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models

Hao Liu; Lisa Lee; Kimin Lee; Pieter Abbeel

Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models

Hao Liu, Lisa Lee, Kimin Lee, Pieter Abbeel

Published: 01 Feb 2023, Last Modified: 26 May 2025Submitted to ICLR 2023Readers: Everyone

Keywords: reinforcement learning, pre-training, multimodal representation, representation learning, transformer

Abstract: Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models lack visual grounding, making it difficult to connect language instructions with visual observations. On the other hand, methods that use pre-trained vision-language models typically come with divided language and visual representations, requiring designing specialized network architecture to fuse them together. We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments. Our InstructRL method consists of a multimodal transformer that encodes visual observations and language instructions, and a policy transformer that predicts actions based on encoded representations. The multimodal transformer is pre-trained on millions of image-text pairs and natural language text, thereby producing generic cross-modal representations of observations and instructions. The policy transformer keeps track of the full history of observations and actions, and predicts actions autoregressively. We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings. Our model also shows better model scalability and generalization ability than prior work.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)

TL;DR: A simple model consists of a pretrained multimodal transformer and policy transformer for instruction following that significantly improves performance

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 7 code implementations](https://www.catalyzex.com/paper/instruction-following-agents-with-jointly-pre/code)

9 Replies

Loading