Policy Optimization using Horizon Regularized Advantage to Improve Generalization in Reinforcement Learning

Nasik Muhammad Nafi, Raja Farrukh Ali, William H. Hsu, Kevin Duong, Mason Vick

Published: 01 Jan 2024, Last Modified: 25 May 2024AAMAS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this work, we focus on improving the generalization performance of a reinforcement learning (RL) agent in diverse environments. We observe that in environments created under the Contextual Markov Decision Process (CMDP), where an environment's dynamics and attribute distribution change across contexts, the generated episodes are highly stochastic and unpredictable. To improve generalization in such scenarios, we present Horizon Regularized Advantage (HRA) estimation that enables robustness to the underlying uncertainty of episode duration. Using three challenging RL generalization benchmarks Procgen, Crafter, and Minigrid we demonstrate that our proposed approach outperforms the Proximal Policy Optimization (PPO) baseline that uses classical single exponential discounting-based advantage estimate. We also incorporate HRA into another generalization-specific approach (APDAC), and the results indicate further improvement in APDAC's generalization ability. This denotes the effectiveness of our approach as a generic component that can be incorporated into any policy gradient method to aid generalization.