Keywords: reinforcement learning, policy gradient, stochastic variance reduce gradient, sample efficiency, stochastic mirror descent
TL;DR: We propose a sample efficient policy gradient method with stochastic mirror descent via conducting a variance reduced policy gradient estimator.
Abstract: Improving sample efficiency has been a longstanding goal in reinforcement learning.
In this paper, we propose the $\mathtt{VRMPO}$: a sample efficient policy gradient method with stochastic mirror descent.
A novel variance reduced policy gradient estimator is the key of $\mathtt{VRMPO}$ to improve sample efficiency.
Our $\mathtt{VRMPO}$ needs only $\mathcal{O}(\epsilon^{-3})$ sample trajectories to achieve an $\epsilon$-approximate first-order stationary point,
which matches the best-known sample complexity.
We conduct extensive experiments to show our algorithm outperforms state-of-the-art policy gradient methods in various settings.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/arxiv:1906.10462/code)
Original Pdf: pdf
9 Replies
Loading