On Proximal Policy Optimization's Heavy-Tailed Gradients Download PDF

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone
Keywords: Heavy-tailed Gradients, Proximal Policy Optimization, Robust Estimation, Deep Reinforcement Learning
Abstract: Modern policy gradient algorithms, notably Proximal Policy Optimization (PPO), rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich ("heavy-tailed") regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate pronounced heavy-tailedness of the gradients, specifically for the actor network, which increases as the current policy diverges from the behavioral one (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources to the observed heavy-tailedness. Subsequently, we study the effects of the standard PPO clipping heuristics, demonstrating how these tricks primarily serve to offset heavy-tailedness in gradients. Motivated by these connections, we propose incorporating GMOM (a high-dimensional robust estimator) into PPO as a substitute for three clipping tricks, achieving performance close to PPO (with all heuristics enabled) on a battery of MuJoCo continuous control tasks.
One-sentence Summary: We study the heavy-tailed behavior of gradients in PPO and propose incorporating GMOM (a robust estimator from statistic) as a substitute.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Reviewed Version (pdf): https://openreview.net/references/pdf?id=ndw2cNplG_
13 Replies

Loading