Abstract: Policy gradient methods are widely used in reinforcement learning. Yet, the nonconvexity of policy optimization imposes significant challenges in understanding the global convergence of policy gradient methods. For
a class of finite-horizon Markov Decision Processes (MDPs) with general state and action spaces, we develop
a framework that provides a set of easily verifiable assumptions to ensure the Kurdyka-Lojasiewicz (KL)
condition of the policy optimization. Leveraging the KL condition, policy gradient methods converge to the
globally optimal policy with a non-asymptomatic rate despite nonconvexity. Our results find applications
in various control and operations models, including entropy-regularized tabular MDPs, Linear Quadratic
Regulator (LQR) problems, stochastic inventory models, and stochastic cash balance problems, for which
we show an ǫ-optimal policy can be obtained using a sample size in $O(\epsilon^{-1})$ and polynomial in terms of the
planning horizon by stochastic policy gradient methods. Our result establishes the first sample complexity
for multi-period inventory systems with Markov-modulated demands and stochastic cash balance problems
in the literature.
Loading