Dueling in the Dark: An Efficient and Optimal $O(\sqrt{T})$ Mirror Descent Approach for Competing against Adversarial Preferences
Keywords: Large Language Models (LLMs), Reinforcement Learning from Human Feedback (RLHF), gradient descent-based algorithm, theoretical foundations, active no-regret learning, preference feedback, trajectory preferences, multi-way feedback, human-AI alignment, practical impact.
TL;DR: This paper introduces a gradient descent-based algorithm with no-regret guarantees for adversarial dueling bandits, which has implications in theoretical understanding of RLHF
Abstract: Recent developments in Large Language Models (LLMs) have sparked significant attention in Reinforcement Learning from Human Feedback (RLHF), which uses reinforcement learning techniques to optimize a model's performance through human-provided feedback. A simple, widely used, and cost-effective method for gathering human feedback is through relative queries based on human preferences, often modeled using sigmoid utility models. Despite the popularity of sigmoid model-based RLHF algorithms, their theoretical foundations remain underdeveloped as existing algorithms often lack performance guarantees or are limited to small-scale problems due to computationally intractable steps. We address the challenge of developing no-regret learning algorithms for training optimal policy RLHF, and develop the first efficient gradient descent-based algorithm with near-optimal regret guarantees. More technically, we consider the adversarial online convex optimization problem with preference feedback and propose a mirror descent method to obtain a regret of $O(\sqrt{T})$ over $T$ rounds. The main challenge we are required to solve lies in finding a suitable `gradient-approximation' of the underlying utility functions solely from a binary preference feedback. Following this we extend our results to policy optimization in the RLHF framework with trajectory preferences and design no-regret RL policies using a variant of mirror descent. We also extend our methods beyond pairwise preferences --- to multi-way (batched pairwise) feedback and ranking feedback --- and analyze the trade-off between learning rate with increasing subset size. Our contribution lays the groundwork for a practical gradient descent-based algorithm in RLHF with human preferences. Supported by robust theoretical guarantees, our approach holds promise in the current landscape of developing efficient algorithms for LLMs and addressing human-AI alignment challenges. Empirical evaluations validate our theoretical findings.
Supplementary Material: pdf
Primary Area: learning theory
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8957
Loading