PAC-Bayesian Reinforcement Learning Trains Generalizable Policies

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, PAC-Bayes
TL;DR: Traditional generalization bounds assume independent data, but RL trajectories are sequential and dependent, making classical bounds inapplicable or vacuous for reinforcement learning.
Abstract: We derive a novel PAC-Bayesian generalization bound for reinforcement learning (RL) that explicitly accounts for Markov dependencies in the data, through the chain’s mixing time. This contributes a step to overcoming challenges in obtaining generalization guarantees for RL where the sequential nature of data does not meet independence assumptions underlying classical bounds. Our bound provides non-vacuous certificates for modern off-policy algorithms like Soft Actor-Critic. We demonstrate the bound’s practical utility through PB-SAC, an algorithm that optimizes the bound during training to guide exploration. Experiments across continuous control tasks show that our approach provides meaningful confidence certificates while maintaining competitive performance.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 19914
Loading