A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach

Swetha Ganesh; Washim Uddin Mondal; Vaneet Aggarwal

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach

Swetha Ganesh, Washim Uddin Mondal, Vaneet Aggarwal

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Order-optimal global convergence rate for an actor-critic approach in average reward RL with general policy parametrization

Abstract: This work examines average-reward reinforcement learning with general policy parametrization. Existing state-of-the-art (SOTA) guarantees for this problem are either suboptimal or hindered by several challenges, including poor scalability with respect to the size of the state-action space, high iteration complexity, and a significant dependence on knowledge of mixing times and hitting times. To address these limitations, we propose a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm. Our work is the first to achieve a global convergence rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ for average-reward Markov Decision Processes (MDPs) (where $T$ is the horizon length), using an Actor-Critic approach. Moreover, the convergence rate does not scale with the size of the state space, therefore even being applicable to infinite state spaces.

Lay Summary: This work focuses on training decision-making systems that aim to maximize long-term rewards without any discounting, a problem known as average-reward reinforcement learning. Existing theoretical results are either suboptimal or struggle to scale when faced with large state and action spaces. To address these limitations, we introduce a new method called MLMC-NAC, short for Multi-level Monte Carlo-based Natural Actor-Critic. The use of Multi-level Monte Carlo (MLMC) helps to efficiently reduce the bias from Markovian sampling.

Primary Area: Theory->Reinforcement Learning and Planning

Keywords: Actor-Critic, Multi-level Monte Carlo, Natural Policy Gradient, General Policy Parametrization

Submission Number: 11381

Loading