Actor-only and Safe-Actor-only REINFORCE Algorithms with Deterministic Update Times

TMLR Paper4446 Authors

11 Mar 2025 (modified: 26 Jun 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Regular Monte-Carlo policy gradient reinforcement learning (RL) algorithms require aggregation of data over regeneration epochs constituting an episode (until a termination state is reached). In real-world applications involving large state and action spaces, the hitting times for goal states can be very sparse or infrequent resulting in large episodes of unpredictable length. As an alternative, we present an RL algorithm called Actor-only algorithm (AOA) that performs data aggregation over a certain (deterministic) number of epochs. This helps remove unpredictability in the data aggregation step and thereby the update instants. Note also that satisfying safety constraints in RL is extremely crucial in safety-critical applications. We also extend the aforementioned AOA to the setting of safe RL that we call Safe-Actor-only algorithm (SAOA). In this work, we provide the asymptotic and finite-time convergence guarantees of our proposed algorithms to obtain the optimal policy. The finite-time analysis of our proposed algorithms demonstrates that finding a first-order stationary point, i.e., $\left\|\nabla \bar J\left(\theta\right)\right\|_2^2\leq \epsilon$ and $\left\|\nabla \bar {\mathcal{L}}\left(\theta,\eta\right)\right\|_2^2\leq \epsilon$ of performance function $\bar J(\theta)$ and $\bar{\mathcal{L}}(\theta,\eta)$, respectively, both with $\mathcal{O}(\epsilon^{-2})$ sample complexity. Further, our empirical results on benchmark RL environments demonstrate the advantages of proposed algorithms over considered algorithms in the literature.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We thank the Editor and the Reviewers for their comments that have helped in improving the overall quality of this manuscript. We have revised the paper taking into consideration all the comments and suggestions of the reviewers. The changes in the revised version that we are submitting now are highlighted in text color 'blue'. Broad Changes: 1. We have improved the exposition of the contributions of our work by clearly mentioning the novelty. 2. We have enhanced the clarity, quality, and presentation of the paper by incorporating specific comments and suggestions of the reviewers. We have revised sections 1-2 and 4-8 in the paper. 3. We have introduced Remark 1 in Section 6.1 to address the significance and required assumptions for our important lemmas. In this section, we have introduced Remark 2 to adequately address the comments of the reviewers. 3. We have compared our finite-time complexity with five more works in the literature by adding 4 new rows in Table 1 and introducing comparative analysis at Remark 3 and Remark 4 in the revised version. In Remarks 3 and 4, we mention the novelty of our proposed algorithms in relation to other algorithms in the literature. 4. We have now shown the results of experiments on another two standard RL benchmark environments, 'CartPole' and 'Acrobot', in addition to different 'Grid World' environments that were shown in the earlier version. We have introduced a new figure, Figure 5, and a new table, Table 4, to demonstrate the performance of our approach in the CartPole and Acrobot environments in the revised version. These results confirmed that the performance of our algorithms is generalizable across environments. 5. We have enriched the literature survey by adding additional relevant references.
Assigned Action Editor: ~Ahmet_Alacaoglu2
Submission Number: 4446
Loading