Calibrated Value-Aware Model Learning with Probabilistic Environment Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We prove a flaw in the popular VAML family of algorithms, propose a fix, and validate it empirically.
Abstract: The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcement learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.
Lay Summary: This paper analyzes Value-Aware Model Learning (VAML), including the MuZero loss, in model-based reinforcement learning. VAML-based losses train a model to predict accurate estimates of the value of taking an action in each state, instead of training it to predict states themselves accurately. It identifies that current sampled-based VAML losses are "uncalibrated" when used with stochastic environment models. This means that they learn solutions that are slightly flawed and produce suboptimal results when used for learning and planning. This error stems from the fact that the model's prediction become too confident (reducing the variance of the prediction) and too smooth (assigning similar values to states that should potentially not have them). The paper proposes "Corrected VAML" (CVAML) by adding a variance correction term. The authors formally prove that this enables more accurate value function and model recovery. In addition, they provide empirical evidence which suggests calibrated stochastic models can be advantageous in difficult control tasks over uncalibrated or deterministic models.
Link To Code: https://github.com/adaptive-agents-lab/CVAML
Primary Area: Reinforcement Learning
Keywords: model-based rl, muzero, itervaml, theory, calibration, latent models
Submission Number: 6896
Loading