- Abstract: In this paper, we propose the StochAstic Recursive grAdient Policy Optimization (SARAPO) algorithm which is a novel variance reduction method on Trust Region Policy Optimization (TRPO). The algorithm incorporates the StochAstic Recursive grAdient algoritHm(SARAH) into the TRPO framework. Compared with the existing Stochastic Variance Reduced Policy Optimization (SVRPO), our algorithm is more stable in the variance. Furthermore, by theoretical analysis the ordinary differential equation and the stochastic differential equation (ODE/SDE) of SARAH, we analyze its convergence property and stability. Our experiments demonstrate its performance on a variety of benchmark tasks. We show that our algorithm gets better improvement in each iteration and matches or even outperforms SVRPO and TRPO.
- Keywords: reinforcement learning, policy gradient, variance reduction, stochastic recursive gradient algorithm
- TL;DR: This paper proposes the StochAstic Recursive Gradient Policy Optimization (SARAPO) algorithm based on the novel SARAH method, and exemplifies its advantages over existing policy gradient methods from both theory and experiments.