Policy Optimization with Augmented Value Targets for Generalization in Reinforcement Learning

Published: 01 Jan 2023, Last Modified: 25 May 2024IJCNN 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Our work aims to improve the generalization performance of a reinforcement learning (RL) agent in unseen environment variations. The value function used in RL agents is frequently overfitted, leading to poor generalization performance. In this work, we argue that the task completion time is highly impacted by the varying environmental conditions, thus resulting in variation in episode lengths, and consequently, the value estimation. Therefore, learning from a limited variation of the environments, the agent gets biased to the value estimates that correspond to the observed episode lengths. To this end, we introduce Augmented Value Targets (AVaTar), which generates multiple value function targets considering the possibility of episode length variation and optimizes the value function with the average of these targets. We demonstrate that optimizing the average of the augmented targets is computationally more feasible than independently leveraging those pseudotargets. Evaluations on the Procgen and Crafter benchmark show that our proposed approach is effective in generalizing the value estimates over unseen contexts and significantly outperforms the standard policy gradient algorithm Proximal Policy Optimization (PPO). Furthermore, comparison and integration with the recent generalizationspecific approach UCB-DrAC indicate that AVaTar outperforms UCB-DrAC in most of the environments from Procgen.
Loading