Keywords: Preference Optimization, Reinforcement Learning, Logarithmic Regret
Abstract: The remarkable sample efficiency of preference-based reinforcement learning, which underpins the alignment of large language models with human feedback (RLHF), presents a significant theoretical puzzle. Existing analyses often rely on idealized assumptions, such as infinite-particle ensembles or exact, full-batch gradients, that are disconnected from the practical realities of deployed algorithms. This paper closes this theory-practice gap. We introduce a unified optimistic PAC-Bayesian framework that distills the statistical essence of complex, multi-stage RLHF pipelines into a single, provably efficient online learning algorithm. Our central result is a high-probability regret bound of $\\widetilde{\mathcal{O}}(d_{\mathrm{eluder}}\log T)$ for a rich, non-linear class of reward models, demonstrating that logarithmic regret is achievable even when using finite ensembles and noisy stochastic gradient updates. This unified theory provides an explanation for the sample efficiency of pairwise preference optimization, extends naturally to full Markov Decision Processes, and establishes a theoretical foundation for the empirical success of methods like RLHF.
Primary Area: reinforcement learning
Submission Number: 17556
Loading