Convergence of Policy Mirror Descent Beyond Compatible Function Approximation

Uri Sherman; Tomer Koren; Yishay Mansour

Convergence of Policy Mirror Descent Beyond Compatible Function Approximation

Uri Sherman, Tomer Koren, Yishay Mansour

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We develop a theoretical framework for PMD in the agnostic, non-complete policy class setting, and prove upper bounds on the rate of convergence w.r.t. the best-in-class policy

Abstract: Modern policy optimization methods roughly follow the policy mirror descent (PMD) algorithmic template, for which there are by now numerous theoretical convergence results. However, most of these either target tabular environments, or can be applied effectively only when the class of policies being optimized over satisfies strong closure conditions, which is typically not the case when working with parametric policy classes in large-scale environments. In this work, we develop a theoretical framework for PMD for general policy classes where we replace the closure conditions with a generally weaker variational gradient dominance assumption, and obtain upper bounds on the rate of convergence to the best-in-class policy. Our main result leverages a novel notion of smoothness with respect to a local norm induced by the occupancy measure of the current policy, and casts PMD as a particular instance of smooth non-convex optimization in non-Euclidean space.

Lay Summary: Modern policy optimization methods for reinforcement learning (such as the popular proximal policy optimization algorithm) roughly follow an algorithmic template called policy mirror descent (PMD). There are an abundance of theoretical research works that establish convergence of PMD in either (i) the tabular setting where the state space is small, or (ii) in the function approximation setting (where the policies are represented by e.g., neural networks) but only subject to strong assumptions on the policy class called closure conditions. Unfortunately, closure conditions are generally deemed too strong to hold in practice, since roughly speaking, they require the policy class to be "essentially complete" --- to contain all possible policies. Modern large scale environements are many orders of magnitude too large for realistically sized neural networks to represent all possible policies. Our work establishes convergence of PMD in the function approximation setup, while relaxing closure conditions and assuming instead a variational gradient dominance assumption, which is generally weaker. In particular, our assumption accommodates agnostic, non-realizable settings, while closure conditions do not, and seems more plausible of an assumption to adopt in the context of real world problems.

Primary Area: Theory->Reinforcement Learning and Planning

Keywords: Reinforcement Learning Theory, Function Approximation, Optimization

Submission Number: 10654

Loading