Dueling in the Dark: An Efficient and Optimal Mirror Descent Approach for Online Optimization with Adversarial Preferences

Published: 10 Oct 2024, Last Modified: 07 Dec 2024NeurIPS 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models (LLMs), Reinforcement Learning from Human Feedback (RLHF), gradient descent-based algorithm, theoretical foundations, active no-regret learning, preference feedback, trajectory preferences, multi-way feedback, human-AI alignment, practical impact.
TL;DR: This paper introduces a gradient descent-based algorithm with no-regret guarantees for adversarial dueling bandits, which has implications in theoretical understanding of RLHF
Abstract: Recent developments in Large Language Models (LLMs) have sparked significant attention in Reinforcement Learning from Human Feedback (RLHF). A simple, widely used, and cost-effective method for gathering human feedback is through relative queries based on human preferences, often modeled using sigmoid utility models. Despite the popularity of sigmoid preference-based RLHF algorithms, their theoretical foundations remain underdeveloped as existing algorithms often lack performance guarantees or are limited to small-scale problems due to computationally intractable steps. We address the challenge of developing no-regret learning algorithms for training optimal policy RLHF, and develop one of the first efficient gradient descent-based algorithm with near-optimal regret (as well as sample complexity) guarantees. More technically, we consider the adversarial online convex (linear) optimization (OLO) problem in $d$-dimension with preference feedback and propose an efficient mirror descent based approach with optimal $\tilde O(d \sqrt T)$ regret bound over $T$ rounds. The main challenge lies in finding a suitable gradient approximation of the underlying utility functions solely from the weaker preference feedback, as opposed to the conventional gradient or value feedback used in OLO. We also extend our methods beyond pairwise preferences to multi-way ($B$-sized batched pairwise) preference feedback and show an improved learning rate of $\tilde O(\frac{d}{\sqrt{\min\{B,d\}}} \sqrt T)$ which establish the right trade-off between learning rate vs batch size. Our contribution lays the groundwork for a practical gradient descent-based algorithm in RLHF. Supported by robust theoretical guarantees, our approach holds promise in the current landscape of developing efficient algorithms for LLMs and addressing human-AI alignment challenges.
Submission Number: 99
Loading