Keywords: reinforcement learning, policy gradient, stochastic approximation, finite-time MDP
Abstract: Markov Decision Processes (MDPs) deliver a formal framework for modeling and solving sequential decision-making problems. In this paper, we make several contributions towards the theoretical understanding of (stochastic) policy gradient methods for MDPs. The focus lies on proving convergence (rates) of softmax policy gradient towards global optima in undiscounted finite-time horizon problems, i.e. $\gamma=1$, without regularization. Such problems are relevant for instance for optimal stopping or specific supply chain problems. Our estimates must differ significantly from several recent articles that involve powers of $(1-\gamma)^{-1}$.
The main contributions are the following. For undiscounted finite-time MDPs we prove asymptotic convergence of policy gradient to a global optimum and derive a convergence rate using a weak Polyak-\L ojasiewicz (PL) inequality. In each decision epoch, the derived error bound depends linearly on the remaining duration of the MDP. In the second part of the analysis, we quantify the convergence behavior for the stochastic version of policy gradient. The analysis yields complexity bounds for an approximation arbitrarily close to the global optimum with high probability.
As a by-product, our stochastic gradient arguments prove that the plain vanilla REINFORCE algorithm for softmax policies indeed approximates global optima for sufficiently large batch sizes.
Supplementary Material: pdf
Submission Number: 12820
Loading