Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Study of last-iterate convergence guarantees to the optimal deterministic policy of policy gradients dynamically adjusting the stochasticity of the (hyper)policy.
Abstract: *Policy gradient* (PG) methods are effective *reinforcement learning* (RL) approaches, particularly for continuous problems. While they optimize stochastic (hyper)policies via action- or parameter-space exploration, real-world applications often require deterministic policies. Existing PG convergence guarantees to deterministic policies assume a fixed stochasticity in the (hyper)policy, tuned according to the desired final suboptimality, whereas practitioners commonly use a dynamic stochasticity level. This work provides the theoretical foundations for this practice. We introduce PES, a phase-based method that reduces stochasticity via a deterministic schedule while running PG subroutines with fixed stochasticity in each phase. Under gradient domination assumptions, PES achieves last-iterate convergence to the optimal deterministic policy with a sample complexity of order $\widetilde{\mathcal{O}}(\epsilon^{-5})$. Additionally, we analyze the common practice, termed SL-PG, of jointly learning stochasticity (via an appropriate parameterization) and (hyper)policy parameters. We show that SL-PG also ensures last-iterate convergence with a rate $\widetilde{\mathcal{O}}(\epsilon^{-3})$, but to the optimal stochastic (hyper)policy only, requiring stronger assumptions compared to PES.
Lay Summary: Reinforcement Learning (RL) is a subfield of machine learning in which agents learn through interaction with an environment to determine the optimal behavior in sequential decision-making problems. Among the various families of RL methods, policy gradient (PG) approaches have demonstrated notable success in tackling continuous control tasks. These methods directly learn the parameters of stochastic (hyper)policies by exploring either at the action level or the parameter level, depending on a certain level of exploration. While theoretical convergence guarantees for PG methods typically assume a fixed level of exploration, practitioners often adjust it dynamically during training. In this work, we bridge this gap between theory and practice by providing convergence guarantees for PG methods under a dynamically changing level of exploration, thus offering a theoretical foundation for a common empirical practice.
Link To Code: https://github.com/MontenegroAlessandro/MagicRL
Primary Area: Reinforcement Learning->Policy Search
Keywords: reinforcement learning, policy gradients, convergence, deterministic policies
Submission Number: 13207
Loading