Learning to Model the World with Language

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: reinforcement learning, language and rl, language grounding, world models, multi-modal world models
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose Dynalang, an agent that can leverage diverse language by learning a multimodal world model that predicts future text and image representations.
Abstract: To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world. While current agents learn to execute simple language instructions, we aim to build agents that leverage diverse language—language like “this button turns on the TV” or “I put the bowls away”—that conveys general knowledge, describes the state of the world, provides interactive feedback, and more. Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future: what they will observe, how the world will behave, and which situations will bring high reward. This perspective unifies language understanding with future prediction as a powerful self-supervised learning objective. We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations, and learns to act from imagined model rollouts. Unlike current agents that use language to predict actions only, Dynalang acquires a rich language understanding by learning to predict future language, video, and rewards. In addition to learning from online interaction in an environment, we show that Dynalang can be pretrained on text-only datasets, enabling learning from more general, offline datasets. From using language hints in grid worlds to navigating photorealistic home scans, Dynalang can leverage diverse types of language, e.g. environment descriptions, game rules, and instructions.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4380
Loading