TL;DR: Actor-critic algorithms with general function approximation can achieve $\sqrt{T}$ regret and $1/\epsilon^2$ sample-complexity without assuming reachability or coverage.
Abstract: Actor-critic algorithms have become a cornerstone in reinforcement learning (RL), leveraging the strengths of both policy-based and value-based methods. Despite recent progress in understanding their statistical efficiency, no existing work has successfully learned an $\epsilon$-optimal policy with a sample complexity of $O(1/\epsilon^2)$ trajectories with general function approximation when strategic exploration is necessary. We address this open problem by introducing a novel actor-critic algorithm that attains a sample-complexity of $O(dH^5 \log|\mathcal{A}|/\epsilon^2 + d H^4 \log|\mathcal{F}|/ \epsilon^2)$ trajectories, and accompanying $\sqrt{T}$ regret when the Bellman eluder dimension $d$ does not increase with $T$ at more than a $\log T$ rate. Here, $\mathcal{F}$ is the critic function class, and $\mathcal{A}$ is the action space. Our algorithm integrates optimism, off-policy critic estimation targeting the optimal Q-function, and rare-switching policy resets. We extend this to the setting of Hybrid RL, where we show that initializing the critic with offline data yields sample efficiency gains, and also provide a \textit{non-optimistic} provably efficient actor-critic algorithm, addressing another open problem in the literature. Numerical experiments support our theoretical findings.
Lay Summary: Reinforcement learning is a type of machine learning where an agent learns by trying different actions and getting feedback, much like how people learn through trial and error. One popular paradigm within it combines two parts: one that decides what to do (the actor) and one that evaluates how good those decisions are (the critic). However, current actor-critic methods can be slow and inefficient, especially when the agent needs to explore and try new things informed by inexact critic estimates, while the critic has to continually evaluate an ever-changing actor. It has been an open question as to whether one can devise an actor-critic algorithm that converges at an optimal rate when the critic can be arbitrarily parameterized -- with deep neural nets, linear regressions, random forests, or some other class of machine learning algorithms.
We provide a method that does so by exploring strategically, using past experience more effectively, and occasionally restarting the decision-making process to avoid getting stuck. We also show that using previously collected data, rather than only learning from scratch, can speed things up even more. Our approach not only improves learning efficiency but also answers a long-standing question in the field about whether these systems can be sample efficient or not.
Link To Code: https://github.com/hetankevin/hybridcov
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: actor-critic, policy gradient, strategic exploration, optimism, reinforcement learning, sample efficiency
Submission Number: 12702
Loading