On the Suboptimality of Semi-Markov Decision Process in Hierarchical Reinforcement Learning

Bingyun Liu; Yuheng Jing; Tian Zhang; Junliang Xing; Jian Cheng; Yifan Zhang

On the Suboptimality of Semi-Markov Decision Process in Hierarchical Reinforcement Learning

Bingyun Liu, Yuheng Jing, Tian Zhang, Junliang Xing, Jian Cheng, Yifan Zhang

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Hierarchical Reinforcement Learning, Reinforcement Learning Theory, Suboptimality, Semi-Markov Process

TL;DR: In HRL, the calling modes of the target policy, behavior policy, and deployment policy should be different.

Abstract: Hierarchical Reinforcement Learning (HRL) demonstrates highly efficient exploration in long-horizon decision-making problems with sparse rewards via the Semi-Markov Decision Process (SMDP). However, we observe a structural limitation of SMDP in HRL: once calling a subtask, the agent is locked into a fixed course of action, losing the flexibility to adapt to other higher-value subtasks, which is a critical barrier in optimality. To address this issue, we first decompose this suboptimality into execution suboptimality and policy suboptimality, and then propose corresponding algorithmic improvement frameworks. On the theoretical side, we reveal a fundamental design flaw in HRL where SMDP is simultaneously adopted in both the target and behavior policies. To overcome this flaw, we introduce the concepts of task tree and execution tree to decouple them, reducing the problem to a tradeoff between exploration and exploitation over policy execution modes. By constructing a unified value function and a generalized hierarchical Bellman equation, we achieve a multi-level value formalization. Upon this, we further propose Hierarchical Policy Improvement Theorem and Optimal Execution Theorem. These results theoretically prove the existence of two types of suboptimality and provide guarantees for the proposed improvement frameworks. Controlled experiments across diverse environments consistently validate both the correctness of our theory and the effectiveness of the proposed improvements.

Supplementary Material: zip

Primary Area: learning theory

Submission Number: 11983

Loading