Keywords: Hierarchical Reinforcement Learning, Combinatorial Optimization
TL;DR: GoalZero is a model-based HRL framework that learns a multi-timescale SMDP world model for MuZero-style planning on SSCO tasks.
Abstract: Sequential Stochastic Combinatorial Optimization (SSCO) problems are challenging for reinforcement learning due to exponentially large action spaces, stochastic dynamics, and the need for long-horizon planning under limited resources. Hierarchical Reinforcement Learning (HRL) offers a natural decomposition, but the high-level policy operates in a Semi-Markov Decision Process (SMDP) where actions have variable durations. This variability complicates learning a planning-ready world model. We introduce **GoalZero**, a model-based HRL framework that directly addresses this challenge. GoalZero integrates a MuZero-style planner at the high level that learns a world model of SMDP dynamics. At the core is a principled framework for **multi-timescale SMDP** (MTS-SMDP) world-model learning. Through complementary objectives, the agent learns dynamics where the **latent transition magnitude** correlates with the temporal scale of the corresponding subgoal, facilitating planning over diverse, adaptive temporal abstractions in our evaluated settings. In addition, we propose a subgoal-conditioned budget allocation mechanism learned jointly with the multi-timescale world model, facilitating context-aware resource management. We demonstrate that GoalZero outperforms strong baselines on challenging SSCO benchmarks.
Primary Area: reinforcement learning
Submission Number: 18977
Loading