Variance Reduced Smoothed Functional REINFORCE Policy Gradient Algorithms

TMLR Paper2436 Authors

28 Mar 2024 (modified: 09 Oct 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We revisit the REINFORCE policy gradient algorithm from the literature. This algorithm typically works with reward (or cost) returns obtained over episodes or trajectories. We pro- pose a major enhancement to the basic algorithm where we estimate the policy gradient us- ing a smoothed functional (random perturbation) gradient estimator requiring one function measurement over a perturbed parameter. Subsequently, we also propose a two-simulation counterpart of the algorithm that has lower estimator bias. Like REINFORCE, our algorithms are trajectory-based Monte-Carlo schemes and usually suffer from high variance. To handle this issue, we propose two independent enhancements to the basic scheme: (i) use the sign of the increment instead of the original (full) increment that results in smoother albeit possibly slower convergence and (ii) use clipped costs or rewards as proposed in the Proximal Policy Optimization (PPO) based scheme. We analyze the asymptotic convergence of the algorithm in the one-simulation case as well as the case where signed updates are used and briefly discuss the changes in analysis when two-simulation estimators are used. Finally, we show the results of several experiments on various Grid-World settings wherein we compare the performance of the various algorithms with REINFORCE as well as PPO and observe that both our one and two simulation SF algorithms show better performance than these algorithms. Further, the versions of these algorithms with clipped gradients and signed updates show good performance with lower variance.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=NHqQbJ08Bp
Changes Since Last Submission: 1. We have fixed the style format (removed the double line spacing). 2. The reference S. Bhatnagar. The reinforce policy gradient algorithm revisited. In 2023 Ninth Indian Control Conference (ICC), pp. 177–177. IEEE, 2023." has now been added and compared with in the first para on page 2. There we mention the following: "A similar scheme as our first (single-measurement) algorithm is briefly presented in Bhatnagar (2023) that however does not present any analysis of convergence or experiments. Our paper, on the other hand, not only provides a detailed analysis and experiments with the one-measurement scheme, but also analyzes several other related algorithms both for their convergence as well as empirical performance." Please note that the above reference is an extended abstract.
Assigned Action Editor: ~Pablo_Samuel_Castro1
Submission Number: 2436
Loading