Improving Transformer World Models for Data-Efficient RL

Antoine Dedieu; Joseph Ortiz; Xinghua Lou; Carter Wendelken; J Swaroop Guntupalli; Wolfgang Lehrach; Miguel Lazaro-Gredilla; Kevin Patrick Murphy

Improving Transformer World Models for Data-Efficient RL

Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, J Swaroop Guntupalli, Wolfgang Lehrach, Miguel Lazaro-Gredilla, Kevin Patrick Murphy

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We present a ladder of simple improvements to vision-based model-based RL agents which, when taken together, achieve a significantly higher reward and score on the challenging Craftax benchmark.

Abstract: We present an approach to model-based RL that achieves a new state of the art performance on the challenging Craftax-classic benchmark, an open-world 2D survival game that requires agents to exhibit a wide range of general abilities---such as strong generalization, deep exploration, and long-term reasoning. With a series of careful design choices aimed at improving sample efficiency, our MBRL algorithm achieves a reward of 69.66% after only 1M environment steps, significantly outperforming DreamerV3, which achieves $53.2\%$, and, for the first time, exceeds human performance of 65.0%. Our method starts by constructing a SOTA model-free baseline, using a novel policy architecture that combines CNNs and RNNs. We then add three improvements to the standard MBRL setup: (a) "Dyna with warmup", which trains the policy on real and imaginary data, (b) "nearest neighbor tokenizer" on image patches, which improves the scheme to create the transformer world model (TWM) inputs, and (c) "block teacher forcing", which allows the TWM to reason jointly about the future tokens of the next timestep.

Lay Summary: Learning how to rapidly learn new skills from limited data is critical for building general AI systems. As a step towards this goal, we propose a new method for training AI agents for playing 2D video games, including Craftax-classic (a version of Crafter) and Minitar (a simplified version of Atari). Our method is the first to beat human performance on Craftax-classic using limited training data. We achieve this by adding three key enhancements to existing reinforcement learning methods: a better way to preprocess images, a new technique to accurately predict the next frame of the game (as part of the agent's world model), and an improved method for learning the agent strategy from both real data (from the environment) combined with "imaginary data" (generated by the agent's model).

Primary Area: Reinforcement Learning->Online

Keywords: Model Based Reinforcement Learning, Background Planning, Transformer World Model

Submission Number: 5434

Loading