Logarithmic Regret in Preference Learning via Optimistic PAC-Bayesian Particle Ensembles

Logarithmic Regret in Preference Learning via Optimistic PAC-Bayesian Particle Ensembles

ICLR 2026 Conference Submission17556 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Preference Optimization, Reinforcement Learning, Logarithmic Regret

Abstract: The remarkable sample efficiency of preference-based reinforcement learning, which underpins the alignment of large language models with human feedback (RLHF), presents a significant theoretical puzzle. Existing analyses often rely on idealized assumptions, such as infinite-particle ensembles or exact, full-batch gradients, that are disconnected from the practical realities of deployed algorithms. This paper closes this theory-practice gap. We introduce a unified optimistic PAC-Bayesian framework that distills the statistical essence of complex, multi-stage RLHF pipelines into a single, provably efficient online learning algorithm. Our central result is a high-probability regret bound of $\\widetilde{\mathcal{O}}(d_{\mathrm{eluder}}\log T)$ for a rich, non-linear class of reward models, demonstrating that logarithmic regret is achievable even when using finite ensembles and noisy stochastic gradient updates. This unified theory provides an explanation for the sample efficiency of pairwise preference optimization, extends naturally to full Markov Decision Processes, and establishes a theoretical foundation for the empirical success of methods like RLHF.

Primary Area: reinforcement learning

Submission Number: 17556

Loading