Abstract: Regular Monte-Carlo policy gradient reinforcement learning (RL) algorithms require aggregation of data over regeneration epochs constituting an episode (until a termination state is reached). In real-world applications involving large state and action spaces, the hitting times for goal states can be very sparse or infrequent resulting in large episodes of unpredictable length. As an alternative, we present an RL algorithm called Actor-only algorithm (AOA) that performs data aggregation over a certain (deterministic) number of epochs. This helps remove unpredictability in the data aggregation step and thereby the update instants. Note also that satisfying safety constraints in RL is extremely crucial in safety-critical applications. We also extend the aforementioned AOA to the setting of safe RL that we call Safe-Actor-only algorithm (SAOA). In this work, we provide the asymptotic and finite-time convergence guarantees of our proposed algorithms to obtain the optimal policy. The finite-time analysis of our proposed algorithms demonstrates that finding a first-order stationary point, i.e., $\left\|\nabla \bar J\left(\theta\right)\right\|_2^2\leq \epsilon$ and $\left\|\nabla \bar {\mathcal{L}}\left(\theta,\eta\right)\right\|_2^2\leq \epsilon$ of performance function $\bar J(\theta)$ and $\bar{\mathcal{L}}(\theta,\eta)$, respectively, both with $\mathcal{O}(\epsilon^{-2})$ sample complexity. Further, our empirical results on benchmark RL environments demonstrate the advantages of proposed algorithms over considered algorithms in the literature.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Ahmet_Alacaoglu2
Submission Number: 4446
Loading