TL;DR: We develop a new reward design framework, HERON, for reinforcement learning problems where the feedback signals have hierarchical structure.
Abstract: Reward design is a fundamental, yet challenging aspect of reinforcement learning (RL). Researchers typically utilize feedback signals from the environment to handcraft a reward function, but this process is not always effective due to the varying scale and intricate dependencies of the feedback signals. This paper shows by exploiting certain structures, one can ease the reward design process. Specifically, we propose a hierarchical reward design framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning. Both scenarios allow us to design a hierarchical decision tree induced by the importance ranking of the feedback signals to compare RL trajectories. With such preference data, we can then train a reward model for policy learning. We apply HERON to several RL applications, and we find that our framework can not only train high performing agents on a variety of difficult tasks, but also provide additional benefits such as improved sample efficiency and robustness.
Lay Summary: When training AI agents we typically need to construct a reward function, which tells the agent whether they are doing a task well or poorly. Such reward functions are usually constructed by combining several feedback signals such as correctness, cost, and safety into the final reward. However, current methods for designing reward functions are often inflexible and tedious. In order to ease the reward design process, we propose HERON, which constructs the reward with a hierarchical relationship between the different feedback signals. HERON allows us to design flexible reward functions very easily. Experimental results show HERON can be used to train high performing agents in a wide variety of tasks.
Link To Code: https://github.com/abukharin3/HERON
Primary Area: Reinforcement Learning->Deep RL
Keywords: Reinforcement Learning, Reward Modeling, Code Generation
Submission Number: 11228
Loading