PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Perttu Hämäläinen, Amin Babadi, Xiaoxiao Ma, Jaakko Lehtinen

Sep 27, 2018 ICLR 2019 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Proximal Policy Optimization (PPO) is a highly popular model-free reinforcement learning (RL) approach. However, in continuous state and actions spaces and a Gaussian policy -- common in computer animation and robotics -- PPO is prone to getting stuck in local optima. In this paper, we observe a tendency of PPO to prematurely shrink the exploration variance, which naturally leads to slow progress. Motivated by this, we borrow ideas from CMA-ES, a black-box optimization method designed for intelligent adaptive Gaussian exploration, to derive PPO-CMA, a novel proximal policy optimization approach that expands the exploration variance on objective function slopes and only shrinks the variance when close to the optimum. This is implemented by using separate neural networks for policy mean and variance and training the mean and variance in separate passes. Our experiments demonstrate a clear improvement over vanilla PPO in many difficult OpenAI Gym MuJoCo tasks.
  • Keywords: Continuous Control, Reinforcement Learning, Policy Optimization, Policy Gradient, Evolution Strategies, CMA-ES, PPO
  • TL;DR: We propose a new continuous control reinforcement learning method with a variance adaptation strategy inspired by the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) optimization method
0 Replies