DNA: Proximal Policy Optimization with a Dual Network ArchitectureDownload PDF

Published: 31 Oct 2022, Last Modified: 12 Mar 2024NeurIPS 2022 AcceptReaders: Everyone
Keywords: Reinforcement Learning, Policy Gradient, Deep Reinforcement Learning, Noise Scale, Atari, Procgen, Mujoco
TL;DR: Due to large differences in noise levels, Proximal Policy Optimization's performance can be greatly improved by learning value and policy independantly.
Abstract: This paper explores the problem of simultaneously learning a value function and policy in deep actor-critic reinforcement learning models. We find that the common practice of learning these functions jointly is sub-optimal due to an order-of-magnitude difference in noise levels between the two tasks. Instead, we show that learning these tasks independently, but with a constrained distillation phase, significantly improves performance. Furthermore, we find that policy gradient noise levels decrease when using a lower \textit{variance} return estimate. Whereas, value learning noise level decreases with a lower \textit{bias} estimate. Together these insights inform an extension to Proximal Policy Optimization we call \textit{Dual Network Architecture} (DNA), which significantly outperforms its predecessor. DNA also exceeds the performance of the popular Rainbow DQN algorithm on four of the five environments tested, even under more difficult stochastic control settings.
Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2206.10027/code)
17 Replies