HDCS: Hierarchy Discovery and Critic Shaping for Reinforcement Learning with Automaton Specification

HDCS: Hierarchy Discovery and Critic Shaping for Reinforcement Learning with Automaton Specification

TMLR Paper4735 Authors

27 Apr 2025 (modified: 02 Jul 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Training reinforcement learning (RL) agents by scalar reward signals is often infeasible when an environment has sparse and non-Markovian rewards. Deterministic finite-state automaton (DFA) provides a streamlined method for specifying tasks in reinforcement learning (RL) that surpass the limitations of traditional discounted return formulations. However, existing RL algorithms designed to address DFA tasks face unresolved challenges, hindering their practical application. One key issue is that subgoals in the DFA may exhibit hidden hierarchical structures, with some macro-subgoals comprising multiple micro-subgoals in certain orders. Without understanding this hierarchy, RL algorithms may struggle to efficiently solve tasks involving such macro-subgoals. Additionally, the sparse reward problem remains inadequately addressed. Previous approaches, such as potential-based reward shaping, often encounter inefficiencies or result in suboptimal solutions. To address these challenges, we propose a novel RL framework designed to uncover the hierarchical structure of subgoals and accelerating the solving of DFA tasks without changing the original optimal policies, short as HDCS. The framework operates in two phases: first, a hierarchical RL method is used to identify the prerequisites of subgoals and build the hierarchy; second, given any task specification (DFA), the subgoal hierarchy is incorporated into task DFA to make a product DFA, and then a simple and novel critic shaping approach is proposed to accelerate the satisfaction of product DFA, which does not change optimal policies of the original problem. The effectiveness of HDCS is demonstrated through experiments conducted across various domains. Especially, compared with representative baselines, the critic shaping can have 2X or 3X acceleration on task solving.

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=3EnqE6HEzW&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)

Changes Since Last Submission: Figures are made more beautiful. Revised parts are shown in color of blue.

Assigned Action Editor: ~Stephen_James1

Submission Number: 4735

Loading