Keywords: Reinforcement Learning, PAC-Bayes
TL;DR: Traditional generalization bounds assume independent data, but RL trajectories are sequential and dependent, making classical bounds inapplicable or vacuous for reinforcement learning.
Abstract: We derive a novel PAC-Bayesian generalization bound for reinforcement learning (RL) that explicitly accounts for Markov dependencies in the data, through the chain’s mixing time. This contributes a step to overcoming challenges in obtaining generalization guarantees for RL where the sequential nature of data does not meet independence assumptions underlying classical bounds. Our bound provides non-vacuous certificates for modern off-policy algorithms like Soft Actor-Critic. We demonstrate the bound’s practical utility through PB-SAC, an algorithm that optimizes the bound during training to guide exploration. Experiments across continuous control tasks show that our approach provides meaningful confidence certificates while maintaining competitive performance.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 19914
Loading