Policy Optimization with Stochastic Mirror Descent

Anonymous

Sep 25, 2019 ICLR 2020 Conference Blind Submission readers: everyone Show Bibtex
  • Keywords: reinforcement learning, policy gradient, stochastic variance reduce gradient, sample efficiency, stochastic mirror descent
  • TL;DR: We propose a sample efficient policy gradient method with stochastic mirror descent via conducting a variance reduced policy gradient estimator.
  • Abstract: Improving sample efficiency has been a longstanding goal in reinforcement learning. In this paper, we propose the $\mathtt{VRMPO}$: a sample efficient policy gradient method with stochastic mirror descent. A novel variance reduced policy gradient estimator is the key of $\mathtt{VRMPO}$ to improve sample efficiency. Our $\mathtt{VRMPO}$ needs only $\mathcal{O}(\epsilon^{-3})$ sample trajectories to achieve an $\epsilon$-approximate first-order stationary point, which matches the best-known sample complexity. We conduct extensive experiments to show our algorithm outperforms state-of-the-art policy gradient methods in various settings.
0 Replies

Loading