Variance Reduced Smoothed Functional REINFORCE Policy Gradient Algorithms

Published: 28 Aug 2025, Last Modified: 28 Aug 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We revisit the REINFORCE policy gradient algorithm from the literature that works with reward (or cost) returns obtained over episodes or trajectories. We propose a major enhancement to the basic algorithm where we estimate the policy gradient using a smoothed functional (random perturbation) gradient estimator obtained from direct function measurements. To handle the issue of high variance that is typical of REINFORCE, we propose two independent enhancements to the basic scheme: (i) use the sign of the increment instead of the original (full) increment that results in smoother convergence and (ii) use clipped gradient estimates as proposed in the Proximal Policy Optimization (PPO) based scheme. We prove the asymptotic convergence of all algorithms and show the results of several experiments on various MuJoCo locomotion tasks wherein we compare the performance of our algorithms with the recently proposed ARS algorithms in the literature as well as other well known algorithms namely A2C, PPO and TRPO. Our algorithms are seen to be competitive against all algorithms and in fact show the best results on a majority of experiments.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We have added our names and addresses to the final version, and added an acknowledgement after the conclusions section.
Video: https://www.youtube.com/watch?v=89qAu3DwNDs
Supplementary Material: zip
Assigned Action Editor: ~Pablo_Samuel_Castro1
Submission Number: 4685
Loading