Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Offline Reinforcement Learning, Hierarchical Reinforcement Learning
Abstract: Offline goal-conditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches attempt to address this by decomposing tasks into high-level subgoals, most existing methods rely on two-level architectures with separate networks, leading to fundamental limitations: they generate only a single intermediate subgoal, leading the low-level policy to act without awareness of the final goal when misled by erroneous subgoals, and prevent end-to-end optimization due to separate training objectives. We discover a novel solution to these challenges through chain-of-thought reasoning from large language models. Building on this insight, we introduce the Chain-of-Goals Hierarchical Policy (CoGHP), a new framework that reformulates hierarchical control as autoregressive sequence generation within a single unified architecture. CoGHP generates a sequence of latent subgoals and the primitive action in a single forward pass, where each subgoal acts as a "reasoning token" encoding intermediate decision-making. To implement this chain-of-thought approach in hierarchical RL, we pioneer the use of the MLP-Mixer architecture. This design enables efficient cross-token communication through simple feedforward operations and captures consistent structural relationships essential for hierarchical reasoning. Experimental results on challenging navigation and manipulation benchmarks demonstrate that CoGHP consistently outperforms strong baselines, demonstrating its effectiveness for long-horizon offline control tasks.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 15898
Loading