Policy Optimization with Stochastic Mirror DescentDownload PDF

25 Sept 2019 (modified: 22 Oct 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone
Keywords: reinforcement learning, policy gradient, stochastic variance reduce gradient, sample efficiency, stochastic mirror descent
TL;DR: We propose a sample efficient policy gradient method with stochastic mirror descent via conducting a variance reduced policy gradient estimator.
Abstract: Improving sample efficiency has been a longstanding goal in reinforcement learning. In this paper, we propose the $\mathtt{VRMPO}$: a sample efficient policy gradient method with stochastic mirror descent. A novel variance reduced policy gradient estimator is the key of $\mathtt{VRMPO}$ to improve sample efficiency. Our $\mathtt{VRMPO}$ needs only $\mathcal{O}(\epsilon^{-3})$ sample trajectories to achieve an $\epsilon$-approximate first-order stationary point, which matches the best-known sample complexity. We conduct extensive experiments to show our algorithm outperforms state-of-the-art policy gradient methods in various settings.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/arxiv:1906.10462/code)
Original Pdf: pdf
9 Replies

Loading