Keywords: world models, reinforcement learning
TL;DR: Policy-gradient ESNR predicts downstream policy performance, giving a practical diagnostic for “policy-aware” world models and guiding pretraining, architecture tweaks, and policy choice.
Abstract: World models have received significant attention from the robotics and computer vision community, both of whom have started scaling to networks comprising billions of parameters in the hope of unlocking new robot skills. In this paradigm, models are pre-trained on internet-scale data and then fine-tuned on robot data to learn policies. However, it is still unclear what makes a good world model for downstream policy learning. We show that world model prediction loss is in many instances uncorrelated with policy performance, forcing practitioners to train models to completion for correct evaluation. This results in slow, costly iterations of model training and policy evaluation. In this work, we demonstrate that the expected signal-to-noise ratio (ESNR) of policy gradients provides a reliable training-time metric for downstream policy performance. This provides a handle on the world model's policy awareness, which denotes how well a policy can learn from a model. We show that ESNR can be used to understand (1) when world models are sufficiently pre-trained, (2) how architecture changes affect downstream performance and (3) what is the best policy learning method for a given world model. Crucially, ESNR can be computed on-the-fly with minimal overhead and without a trained policy. We validate our metric on traditional architectures and tasks as well as large pretrained world models, demonstrating the practical utility of ESNR for practitioners who wish to train or finetune such models for robot applications. Visualizations and code available here: https://policy-aware.github.io/paper-anon.
Supplementary Material: pdf
Primary Area: reinforcement learning
Submission Number: 2644
Loading