PAC Guarantees for Reinforcement Learning:\\ Sample Complexity, Coverage, and Structure

TMLR Paper6152 Authors

09 Oct 2025 (modified: 17 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Fixed\mbox{--}confidence (PAC) guarantees are the right primitive when data are scarce or failures are costly. This survey organizes the 2018--2025 literature through a Coverage--Structure--Objective (CSO) template, in which sample complexity satisfies $N(\varepsilon,\delta)\!\approx\! \mathsf{Cov}\times \mathsf{Comp}\times \mathrm{poly}(H)\times \varepsilon^{-2}$. Coverage captures access assumptions (online/generative vs.\ offline via concentrability); Structure captures problem\mbox{--}dependent capacity (tabular $SA$, linear dimension $d$, effective dimension $d_{\mathrm{eff}}(\lambda)$, rank $r$, Bellman/witness/BE measures); Objective fixes the target (uniform\mbox{--}PAC/regret, instance\mbox{--}dependent identification, reward\mbox{--}free exploration, offline control/OPE, partial observability). We synthesize: tight tabular baselines; the uniform\mbox{--}PAC $\Rightarrow$ high\mbox{--}probability regret bridge; structured learnability under Bellman rank and Bellman--Eluder dimension; linear, kernel/NTK, and low\mbox{--}rank models; reward\mbox{--}free exploration as coverage creation; and pessimistic offline RL with explicit coverage dependence. Practical outputs include a rate ``cookbook,'' a decision tree, and a unified roadmap of open problems (kernel/NTK uniform\mbox{--}PAC, agnostic low\mbox{--}rank, misspecified offline RL, instance\mbox{--}dependent FA, structure selection). We unify notation, state results with explicit dependencies, and provide a decision toolkit for practitioners.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Anonymized and formatted correctly.
Assigned Action Editor: ~Aleksandra_Faust1
Submission Number: 6152
Loading