GoalZero: Model-based Hierarchical Self-Play for Sequential Stochastic Combinatorial Optimization

GoalZero: Model-based Hierarchical Self-Play for Sequential Stochastic Combinatorial Optimization

ICLR 2026 Conference Submission18977 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Hierarchical Reinforcement Learning, Combinatorial Optimization

TL;DR: GoalZero is a model-based HRL framework that learns a multi-timescale SMDP world model for MuZero-style planning on SSCO tasks.

Abstract: Sequential Stochastic Combinatorial Optimization (SSCO) problems are challenging for reinforcement learning due to exponentially large action spaces, stochastic dynamics, and the need for long-horizon planning under limited resources. Hierarchical Reinforcement Learning (HRL) offers a natural decomposition, but the high-level policy operates in a Semi-Markov Decision Process (SMDP) where actions have variable durations. This variability complicates learning a planning-ready world model. We introduce GoalZero, a model-based HRL framework that directly addresses this challenge. GoalZero integrates a MuZero-style planner at the high level that learns a world model of SMDP dynamics. At the core is a principled framework for multi-timescale SMDP (MTS-SMDP) world-model learning. Through complementary objectives, the agent learns dynamics where the latent transition magnitude correlates with the temporal scale of the corresponding subgoal, facilitating planning over diverse, adaptive temporal abstractions in our evaluated settings. In addition, we propose a subgoal-conditioned budget allocation mechanism learned jointly with the multi-timescale world model, facilitating context-aware resource management. We demonstrate that GoalZero outperforms strong baselines on challenging SSCO benchmarks.

Primary Area: reinforcement learning

Submission Number: 18977

Loading