Keywords: Generalization, Reinforcement Learning, Policy-Value Separation, Policy Optimization, Procgen
TL;DR: This work analyzes the generalization performance compared to the extent of decoupling of policy and value network
Abstract: Existence of policy-value representation asymmetry negatively affects the generalization capability of traditional actor-critic architectures that use a shared representation of policy and value. Fully decoupled/separated networks for policy and value avoid overfitting by addressing this representation asymmetry. However, using two separate networks introduces increased computational overhead. Recent work has also shown that partial separation can achieve the same level of generalization in most tasks while reducing this computational overhead. Thus, the questions arise: Do we really need two separate networks? Is there any particular scenario where only full separation works? Does increasing the degree of separation in a partially separated network help in generalization? In this work, we attempt to analyze the generalization performance vis-a-vis the extent of decoupling of the policy and value networks. We compare four different degrees of network separation, namely: fully shared, early separation, late separation, and full separation on the RL generalization benchmark Procgen, a suite of 16 procedurally-generated environments. We show that unless the environment has a distinct or explicit source of value estimation, partial late separation can easily capture the necessary policy-value representation asymmetry and achieve better generalization performance in unseen scenarios, however, early separation fails to produce good results.
0 Replies
Loading