Abstract: We revisit the REINFORCE policy gradient algorithm from the literature that works with reward (or cost) returns obtained over episodes or trajectories. We propose a major en- hancement to the basic algorithm where we estimate the policy gradient using a smoothed functional (random perturbation) gradient estimator obtained from direct function measure- ments. To handle the issue of high variance that is typical of REINFORCE, we propose two independent enhancements to the basic scheme: (i) use the sign of the increment instead of the original (full) increment that results in smoother convergence and (ii) use clipped gradient estimates as proposed in the Proximal Policy Optimization (PPO) based scheme. We prove the asymptotic convergence of all algorithms and show the results of several ex- periments on various MuJoCo locomotion tasks wherein we compare the performance of our algorithms with the recently well-studied proposed ARS algorithms in the literature. Our algorithms are seen to be competitive when compared to ARS.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=loaWwnhYaS
Changes Since Last Submission: We have incorporated all the comments of reviewers from the previous round and have addressed all the concerns of the reviewers through substantial revisions as follows:
1. We now provide a detailed discussion of related work including evolutionary strategies and augmented random
search based approaches.
2. We now show the results of detailed experiments on MuJoCo locomotion tasks for Swimmer, Hopper, Walker2d and HalfCheetah environments. We observe that for the same number of environment interactions, SFR-2 with clipping and
signed updates performs consistently better over two of the four tasks when compared with the ARS algorithms.
3. We have now given proofs for the bias reduction claim (Lemma 1) and variance reduction claim (Lemma 2 and Remark 1). The detailed proofs of these results are provided in Appendix A.3 and Appendix A.4, respectively.
4. Furthermore, we also prove asymptotic convergence of the ES and ARS algorithms mentioned by the reviewers. Prior work had provided only finite time (non-asymptotic) analyses for these algorithms that too under much stronger requirements. We prove the asymptotic convergence under just two assumptions, namely Assumptions 1 and 2. In fact, we prove all the basic requirements such as the parameterized value function being differentiable with a Lipschitz continuous gradient (Lemma 4). This is unlike the papers
on ES/ARS that make much stronger assumptions but do not prove whether these assumptions
are valid in the settings that they consider.
5. We show in Table 2 the performance comparisons now of our algorithm with ARS-v1t
and ARS-v2t both in their original versions and also with gradient clipping (both component-wise
and norm-clip) and signed updates, where these variants have also been tried on the ARS algorithms.
SFR performs updates more often than ARS, albeit they are bound to have higher variance. This
motivates us to use clipped and signed gradients as it reduces variance and improves performance. We
observe that our algorithm is better on all variants than both ARS-v1t and ARS-v2t on the Swimmer
and HalfCheetah environments, though the original SFR-2 algorithm does not show as good results
as the original ARS-v1t and ARS-v2t (i.e., without clipping and signed updates), see Table 2. On
Walker2d, SFR-2 with Component Clip is better than ARS-v2t but is not as good as ARS-v1t on this
task. This goes in to show that SFR-2 with just two environment interactions per parameter update
(as against 2k, with k>1, for SFR-2) is competitive against both ARS-v1t and ARS-v2t.
6. We also observe from Table 3 of the revised version that SFR-2 achieves the highest reward on HalfCheetah and ranks second on both Swimmer and Hopper, closely trailing ARS. It also surpasses ARS on Walker2d. Overall, our results suggest that
SFR-2, with clipping and signed update mechanisms, is competitive when compared with ARS across a variety of continuous control tasks.
7. We have removed now all discussion pertaining to interchange between the gradient and expectation operators as we agree that such discussion is redundant when the state-action spaces are finite.
Assigned Action Editor: ~Pablo_Samuel_Castro1
Submission Number: 4685
Loading