Overcoming Policy Collapse in Deep Reinforcement Learning

Shibhansh Dohare; Qingfeng Lan; A. Rupam Mahmood

Overcoming Policy Collapse in Deep Reinforcement Learning

Shibhansh Dohare, Qingfeng Lan, A. Rupam Mahmood

Published: 20 Jul 2023, Last Modified: 29 Aug 2023EWRL16Readers: Everyone

Keywords: Reinforcement learning, Scalability, Loss of Pasticity, Forgetting

TL;DR: Investigating and overcoming the problem of policy Collapse in deep reinforcement learning

Abstract: A long-awaited characteristic of reinforcement learning agents is scalable performance, that is, to continue to learn and improve performance with a never-ending stream of experience. However, current deep reinforcement learning algorithms are known to be brittle and difficult to train, which limits their scalability. For example, the learned policy can dramatically worsen after some initial training as the agent continues to interact with the environment. We call this phenomenon \textit{policy collapse}. We first establish that policy collapse can occur in both policy gradient and value-based methods. Policy collapse happens in these algorithms in typical benchmarks such as Mujoco environments when trained with their commonly used hyper-parameters. In a simple 2-state MDP, we show that the standard use of the Adam optimizer with its default hyper-parameters is a root cause of policy collapse. Specifically, the standard use of Adam can lead to sudden large weight changes even when the gradient is small whenever there is non-stationarity in the data stream. We find that policy collapse can be successfully mitigated by using the same hyper-parameters for the running averages of the first and second moments of the gradient. Additionally, we find that aggressive L2 regularization also mitigates policy collapse in many cases. Our work establishes that a minimal change in the existing usage of deep reinforcement learning can mitigate policy collapse and enable more stable and scalable deep reinforcement learning.

1 Reply

Loading