Keywords: Theory for Reinforcement Learning, Discounted Markov Decision Process, Policy Optimization, Policy Gradient, Sample Complexity
TL;DR: We introduce novel unbiased policy gradient algorithms with a random horizon, accompanied by theoretical convergence analysis and empirical experiments.
Abstract: Policy gradient (PG) methods are widely used in reinforcement learning. However, for infinite-horizon discounted reward settings, practical implementations of PG usually must rely on biased gradient estimators, due to the truncated finite-horizon sampling, which limits actual performance and hinders theoretical analysis. In this work, we introduce a new family of algorithms, __unbiased policy gradient__ (UPG), that enables unbiased gradient estimators by considering finite-horizon undiscounted rewards, where the horizon is randomly sampled from a geometric distribution $\mathrm{Geom}(1-\gamma)$ associated to the discount factor $\gamma$. Thanks to the absence of bias, UPG achieves the $\mathcal{O}(\epsilon^{-4})$ sample complexity to a stationary point, which is improved by $\mathcal{O}(\log\epsilon^{-1})$, compared to the one of the vanilla PG, and is met with fewer assumptions. Our work also provides a new angle on well-known algorithms such as Q-PGT and RPG. We recover the unbiased Q-PGT algorithm as a special case of UPG, allowing for its first sample complexity analysis. We further show that UPG can be extended to $\alpha$-UPG, a more generic class of PG algorithms which performs unbiased gradient estimators and notably admits RPG as a special case. The general sample complexity analysis of $\alpha$-UPG that we present enables to recover the convergence rates of RPG, also with tighter bounds. Finally, we propose and evaluate two new algorithms within the UPG family: unbiased GPOMDP (UGPOMDP) and $\alpha$-UGPOMDP. We show theoretically and empirically on four different environments that both UGPOMDP and $\alpha$-UGPOMDP outperform its known vanilla PG counterpart, GPOMDP.
Submission Number: 28
Loading