Policy Optimization with Stochastic Mirror Descent

Long Yang; Gang Zheng; Zavier Zhang; Yu Zhang; Qian Zheng; Jun Wen; Gang Pana sample efficient policy gradient method with stochastic mirror descent.

Policy Optimization with Stochastic Mirror Descent

Long Yang, Gang Zheng, Zavier Zhang, Yu Zhang, Qian Zheng, Jun Wen, Gang Pana sample efficient policy gradient method with stochastic mirror descent.

25 Sept 2019 (modified: 22 Jun 2025)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: reinforcement learning, policy gradient, stochastic variance reduce gradient, sample efficiency, stochastic mirror descent

TL;DR: We propose a sample efficient policy gradient method with stochastic mirror descent via conducting a variance reduced policy gradient estimator.

Abstract: Improving sample efficiency has been a longstanding goal in reinforcement learning. In this paper, we propose the $\mathtt{VRMPO}$: a sample efficient policy gradient method with stochastic mirror descent. A novel variance reduced policy gradient estimator is the key of $\mathtt{VRMPO}$ to improve sample efficiency. Our $\mathtt{VRMPO}$ needs only $\mathcal{O}(\epsilon^{-3})$ sample trajectories to achieve an $\epsilon$-approximate first-order stationary point, which matches the best-known sample complexity. We conduct extensive experiments to show our algorithm outperforms state-of-the-art policy gradient methods in various settings.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/policy-optimization-with-stochastic-mirror/code)

Original Pdf: pdf

9 Replies

Loading