Rethinking Value Function Learning for Generalization in Reinforcement Learning

Seungyong Moon; JunYeong Lee; Hyun Oh Song

Rethinking Value Function Learning for Generalization in Reinforcement Learning

Seungyong Moon, JunYeong Lee, Hyun Oh Song

Published: 31 Oct 2022, Last Modified: 06 Apr 2025NeurIPS 2022 AcceptReaders: Everyone

Keywords: RL generalization

Abstract: Our work focuses on training RL agents on multiple visually diverse environments to improve observational generalization performance. In prior methods, policy and value networks are separately optimized using a disjoint network architecture to avoid interference and obtain a more accurate value function. We identify that a value network in the multi-environment setting is more challenging to optimize and prone to memorizing the training data than in the conventional single-environment setting. In addition, we find that appropriate regularization on the value network is necessary to improve both training and test performance. To this end, we propose Delayed-Critic Policy Gradient (DCPG), a policy gradient algorithm that implicitly penalizes value estimates by optimizing the value network less frequently with more training data than the policy network. This can be implemented using a single unified network architecture. Furthermore, we introduce a simple self-supervised task that learns the forward and inverse dynamics of environments using a single discriminator, which can be jointly optimized with the value network. Our proposed algorithms significantly improve observational generalization performance and sample efficiency on the Procgen Benchmark.

TL;DR: We investigate the difficulty of learning a value network on multiple training environments and propose a simple policy gradient algorithm to improve observational generalization and sample efficiency in Procgen benchmark.

Supplementary Material: pdf

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 6 code implementations](https://www.catalyzex.com/paper/rethinking-value-function-learning-for/code)

13 Replies

Loading