Attention-based Partial Decoupling of Policy and Value for Generalization in Reinforcement LearningDownload PDF

12 Oct 2021, 19:37 (modified: 02 Dec 2021, 06:09)Deep RL Workshop NeurIPS 2021Readers: Everyone
Keywords: Generalization, Reinforcement Learning, meta-learning, Attention, Policy Optimization, Procgen
TL;DR: Partially separates the policy and the value function optimization and incorporates attention for Generalization in Reinforcement Learning
Abstract: In this work, we introduce Attention-based Partially Decoupled Actor-Critic (APDAC), an actor-critic architecture for generalization in reinforcement learning, which partially separates the policy and the value function. To learn directly from images, traditional actor-critic architectures use a shared network to represent the policy and value function. While a shared representation for policy and value allows parameter and feature sharing, it can also lead to overfitting that catastrophically hurts generalization performance. On the other hand, two separate networks for policy and value can help to avoid overfitting and reduce the generalization gap, but at the cost of added complexity both in terms of architecture design and hyperparameter tuning. APDAC provides an intermediate tradeoff that combines the strengths of both architectures by sharing the initial part of the network and separating the later parts for policy and value. It also incorporates an attention mechanism to propagate relevant features to the separate policy and value blocks. Our empirical analysis shows that APDAC significantly outperforms the PPO baseline and achieves comparable performance with respect to the recent state-of-the-art method IDAAC on the challenging RL generalization benchmark Procgen. Our code is available at \url{}.
0 Replies