Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications
Keywords: Reward Design, Human-Centered AI, Hierarchical RL, HAI
TL;DR: Hierarchical rewards better encode human preferences regarding agent behavior.
Abstract: When training AI agents to perform tasks, humans often care not only about $\textit{whether}$ a task is completed but also $\textit{how}$ it is performed. As agents tackle increasingly complex tasks, aligning their behavior with human-provided specifications becomes critical for responsible AI deployment. $\textit{Reward design}$ provides a direct channel for such alignment by translating human expectations into reward functions that guide reinforcement learning (RL). However, existing methods are often too limited to capture nuanced human preferences that arise in long-horizon tasks. Hence, we introduce $\textbf{Hierarchical Reward Design for Language (HRDL)}$: a problem formulation that extends classical reward design to encode richer behavioral specifications for Hierarchical RL agents. We further propose $\textbf{Language to Hierarchical Rewards (L2HR)}$, our proposed solution to HRDL. Human subject and numerical experiments show that Hierarchical RL agents trained with rewards designed via L2HR not only complete tasks effectively but also better adhere to human specifications. Together, HRD and L2HR advance the research on human-aligned AI agents.
Area: Human-Agent Interaction (HAI)
Generative A I: I acknowledge that I have read and will follow this policy.
Submission Number: 1486
Loading