PAC Guarantees for Reinforcement Learning:\\ Sample Complexity, Coverage, and Structure

TMLR Paper6152 Authors

09 Oct 2025 (modified: 21 Dec 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Fixed\mbox{--}confidence (PAC) guarantees are the right primitive when data are scarce or failures are costly. This survey organizes the 2018--2025 literature through a Coverage--Structure--Objective (CSO) template, in which sample complexity satisfies $N(\varepsilon,\delta)\!\approx\! \mathsf{Cov}\times \mathsf{Comp}\times \mathrm{poly}(H)\times \varepsilon^{-2}$. Coverage captures access assumptions (online/generative vs.\ offline via concentrability); Structure captures problem\mbox{--}dependent capacity (tabular $SA$, linear dimension $d$, effective dimension $d_{\mathrm{eff}}(\lambda)$, rank $r$, Bellman/witness/BE measures); Objective fixes the target (uniform\mbox{--}PAC/regret, instance\mbox{--}dependent identification, reward\mbox{--}free exploration, offline control/OPE, partial observability). We synthesize: tight tabular baselines; the uniform\mbox{--}PAC $\Rightarrow$ high\mbox{--}probability regret bridge; structured learnability under Bellman rank and Bellman--Eluder dimension; linear, kernel/NTK, and low\mbox{--}rank models; reward\mbox{--}free exploration as coverage creation; and pessimistic offline RL with explicit coverage dependence. Practical outputs include a rate ``cookbook,'' a decision tree, and a unified roadmap of open problems (kernel/NTK uniform\mbox{--}PAC, agnostic low\mbox{--}rank, misspecified offline RL, instance\mbox{--}dependent FA, structure selection). We unify notation, state results with explicit dependencies, and provide a decision toolkit for practitioners.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Revision R1 (addressing Reviews LL6h, Bnxb, and 6bQW) 1. Bound Tightness and Horizon Exponents (LL6h-RC1, LL6h-RC2) Added Table 3: "Horizon Exponents Across Settings" with explicit H exponents (H^3 to H^6) and tightness annotations (tight/upper/lower) for all major settings. Added tightness annotations after each major theorem (Theorems 3, 4, 5, 6, 7, 8) explicitly stating whether bounds are tight, upper-only, or lower-only with citations. Added explanatory paragraph "Why horizon exponents differ" explaining the root cause (variance accumulation vs. correlated Bellman backups). Replaced generic "poly(H, d)" expressions with exact exponents where known (e.g., "d^2 H^4" rather than "poly(d, H)"). 2. Setting Connections (LL6h-RC3) Added new subsection "How Settings Relate" in Section 2 with: explicit parameter mappings showing how each structured setting recovers tabular (e.g., "Linear MDP with d=SA recovers tabular rates"); inclusion hierarchy (Tabular subset of Linear subset of Low-rank subset of Bilinear subset of Finite d_BE); "When do bounds match across settings?" discussion clarifying rate recovery and gaps; practical guidance for analyzing new problems. 3. Practical Limitations (Bnxb-RC2, 6bQW) Added framed "Assumptions and Scope" box immediately after CSO template in Section 2. Explicitly warns that realizability, Bellman completeness, and coverage assumptions are often violated in deep RL practice. Includes forward pointer to Section 11 (Practical Toolkit) for diagnostics. 4. Redundancy Reduction (Bnxb-RC1) Consolidated uniform-PAC implies regret theorem to Section 3 (Preliminaries) as the canonical version. Replaced full theorem restatements in Section 1 (Introduction) and Section 4 (Tabular) with brief paragraphs and forward references (e.g., "By Theorem 3..."). Reduced main content by approximately 1 page. 5. Accessibility Improvements (6bQW) Added "Reinforcement learning in brief" primer paragraph at the start of Section 1, defining state, action, reward, policy, and the exploration-exploitation tradeoff in accessible terms before technical definitions. Primer explains what fixed-confidence (PAC) guarantees mean and why they matter. 6. Formatting Fixes (6bQW) Fixed table overflow in Tables 2 and 4 using resizebox. Standardized section numbering (removed inconsistent subsection in Section 2). Fixed malformed paragraph header in Section 1. Fixed figure file references. Summary of Changes: Net content change is approximately -0.5 pages (redundancy removed; primer and tables added). All reviewer-requested changes have been implemented. No changes to technical claims or theorems beyond adding tightness annotations.
Assigned Action Editor: ~Aleksandra_Faust1
Submission Number: 6152
Loading