ENTER THE VOID: EXPLORING WITH HIGH ENTROPY PLANS

ENTER THE VOID: EXPLORING WITH HIGH ENTROPY PLANS

ICLR 2026 Conference Submission19814 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: MBRL, Planning, sample efficiency, dreamer, world models, information gain, uncertainty, entropy, optimism

TL;DR: We use the models in MBRL methods to estimate uncertainty and nudge sampling from the environment towards this uncertainty to both train faster and optimise performance.

Abstract: Model-based reinforcement learning (MBRL) offers an intuitive way to increase the sample efficiency of model-free RL methods by simultaneously training a world model that learns to predict the future. These models constitute the large majority of training compute and time and they are subsequently used to train actors entirely in simulation, but once this is done they are quickly discarded. We show in this work that utilising these models at inference time can not only boost performance but also sample efficiency. We propose a novel approach that anticipates and actively seeks out high-entropy states using the world model’s short-horizon latent predictions, offering a principled alternative to traditional curiosity-driven methods that chase once-novel states well after they were stumbled into. While many model predictive control (MPC) based methods offer similar alternatives, they typically lack commitment, synthesising multiple multi-step plans at every step. To mitigate this, we present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the commitment to searching entropy. While our method can theoretically be applied to any model that trains its own actors with solely model generated data, we have applied it to Dreamer to illustrate the concept. Our method finishes Miniworld's procedurally generated mazes 50\% faster than base Dreamer at convergence and in only 60\% of the environment steps that base Dreamer's policy needs; it displays reasoned exploratory behaviour in Crafter, achieves the same reward as base Dreamer in a third of the steps; planning is shown to accelerate and improve even Deepmind Control performance.

Primary Area: reinforcement learning

Submission Number: 19814

Loading