On the Convergence of Single-Timescale Actor-Critic

Navdeep Kumar; Priyank Agrawal; Giorgia Ramponi; Kfir Yehuda Levy; Shie Mannor

On the Convergence of Single-Timescale Actor-Critic

Navdeep Kumar, Priyank Agrawal, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor

Published: 17 Jul 2025, Last Modified: 06 Sept 2025EWRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: MDP, RL, Policy Gradient, Convergence

TL;DR: We improved sample complexity of global convergence of single-time-scale actor critic to $O(\epsilon^{-3})$ from exististing possible rate of $O(\epsilon^{-4})$

Abstract: We analyze the global convergence of the single-timescale actor-critic (AC) algorithm for the infinite-horizon discounted Markov Decision Processes (MDPs) with finite state spaces. To this end, we introduce an elegant analytical framework for handling complex, coupled recursions inherent in the algorithm. Leveraging this framework, we establish that the algorithm converges to an $\epsilon$-close \textbf{globally optimal} policy with a sample complexity of $ O(\epsilon^{-3}) $. This significantly improves upon the existing complexity of $O(\epsilon^{-2})$ to achieve $\epsilon$-close \textbf{stationary policy}, which is equivalent to the complexity of $O(\epsilon^{-4})$ to achieve $\epsilon$-close \textbf{globally optimal} policy using gradient domination lemma. Furthermore, we demonstrate that to achieve this improvement, the step sizes for both the actor and critic must decay as $ O(k^{-\frac{2}{3}}) $ with iteration $k$, diverging from the conventional $ O(k^{-\frac{1}{2}}) $ rates commonly used in (non)convex optimization.

Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.

Serve As Reviewer: ~Navdeep_Kumar1

Track: Regular Track: unpublished work

Submission Number: 57

Loading