Logarithmic Regret in Preference Learning via Optimistic PAC-Bayesian Particle Ensembles

19 Sept 2025 (modified: 26 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Preference Optimization, Reinforcement Learning, Logarithmic Regret
Abstract: The remarkable sample efficiency of preference-based reinforcement learning, which underpins the alignment of large language models with human feedback (RLHF), presents a significant theoretical puzzle. Existing analyses often rely on idealized assumptions, such as infinite-particle ensembles or exact, full-batch gradients, that are disconnected from the practical realities of deployed algorithms. This paperprovides a statistically grounded abstraction of modern RLHF-style training pipelines. We introduce a unified optimistic PAC-Bayesian framework that distills the statistical essence of complex, multi-stage RLHF pipelines into a single, provably efficient online learning algorithm. Our central result is a high-probability regret bound of $\\widetilde{\mathcal{O}}(d_{\mathrm{eluder}}\log T)$ for a rich, non-linear class of reward models, demonstrating when and why logarithmic regret is achievable using finite ensembles and noisy stochastic gradient updates under preference feedback. This unified theory provides an explanation for the sample efficiency of pairwise preference optimization, extends naturally to full Markov Decision Processes, and establishes a theoretical foundation for the empirical success of methods like RLHF.
Primary Area: reinforcement learning
Submission Number: 17556
Loading