DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning

Leander Diaz-Bone; Marco Bagatella; Jonas Hübotter; Andreas Krause

DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning

Leander Diaz-Bone, Marco Bagatella, Jonas Hübotter, Andreas Krause

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, test-time training, test-time reinforcement learning, sparse-reward reinforcement learning, goal selection, goal-conditioned reinforcement learning, exploration, exploration-exploitation, upper confidence bound

TL;DR: We develop DISCOVER, which enables RL agents to solve substantially more challenging tasks than previous exploration strategies in RL.

Abstract: Sparse-reward reinforcement learning (RL) can model a wide range of highly complex tasks. Solving sparse-reward tasks is RL's core premise — requiring efficient exploration coupled with long-horizon credit assignment — and overcoming these challenges is key for building self-improving agents with superhuman ability. We argue that solving complex and high-dimensional tasks requires solving simpler tasks that are *relevant* to the target task. In contrast, most prior work designs strategies for selecting exploratory tasks with the objective of solving *any* task, making exploration of challenging high-dimensional, long-horizon tasks intractable. We find that the sense of direction, necessary for effective exploration, can be extracted from existing reinforcement learning algorithms, without needing any prior information. Based on this finding, we propose a method for _**di**rected **s**parse-reward goal-**co**nditioned **ve**ry long-horizon **R**L_ (DISCOVER), which selects exploratory goals in the direction of the target task. We connect DISCOVER to principled exploration in bandits, formally bounding the time until the target task becomes achievable in terms of the agent's initial distance to the target, but independent of the volume of the space of all tasks. Empirically, we perform a thorough evaluation in high-dimensional simulated environments. We find that the directed goal selection of DISCOVER solves exploration problems that are beyond the reach of prior state-of-the-art exploration methods in RL.

Supplementary Material: zip

Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)

Submission Number: 26983

Loading