PAC-Bayesian Reinforcement Learning Trains Generalizable Policies

PAC-Bayesian Reinforcement Learning Trains Generalizable Policies

ICLR 2026 Conference Submission19914 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, PAC-Bayes

TL;DR: Traditional generalization bounds assume independent data, but RL trajectories are sequential and dependent, making classical bounds inapplicable or vacuous for reinforcement learning.

Abstract: We derive a novel PAC-Bayesian generalization bound for reinforcement learning (RL) that explicitly accounts for Markov dependencies in the data, through the chain’s mixing time. This contributes a step to overcoming challenges in obtaining generalization guarantees for RL where the sequential nature of data does not meet independence assumptions underlying classical bounds. Our bound provides non-vacuous certificates for modern off-policy algorithms like Soft Actor-Critic. We demonstrate the bound’s practical utility through PB-SAC, an algorithm that optimizes the bound during training to guide exploration. Experiments across continuous control tasks show that our approach provides meaningful confidence certificates while maintaining competitive performance.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 19914

Loading