Bayesian Optimization from Human Feedback: Near-Optimal Regret Bounds

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose an algorithm for the problem of Bayesian optimization from preference-based (human) feedback and prove improved and tight regret bounds and sample complexities.
Abstract: Bayesian optimization (BO) with preference-based feedback has recently garnered significant attention due to its emerging applications. We refer to this problem as Bayesian Optimization from Human Feedback (BOHF), which differs from conventional BO by learning the best actions from a reduced feedback model, where only the preference between two actions is revealed to the learner at each time step. The objective is to identify the best action using a limited number of preference queries, typically obtained through costly human feedback. Existing work, which adopts the Bradley-Terry-Luce (BTL) feedback model, provides regret bounds for the performance of several algorithms. In this work, within the same framework we develop tighter performance guarantees. Specifically, we derive regret bounds of $\tilde{\mathcal{O}}(\sqrt{\Gamma(T)T})$, where $\Gamma(T)$ represents the maximum information gain—a kernel-specific complexity term—and $T$ is the number of queries. Our results significantly improve upon existing bounds. Notably, for common kernels, we show that the order-optimal sample complexities of conventional BO—achieved with richer feedback models—are recovered. In other words, the same number of preferential samples as scalar-valued samples is sufficient to find a nearly optimal solution.
Lay Summary: In many decision-making tasks—like tuning a chatbot, designing a product, or choosing ad content—it’s unrealistic to ask people to assign precise scores to every option. But people are usually much better at comparing two options and saying which one they prefer. This kind of preference feedback is often more reliable, even if it’s less detailed. Still, collecting it can be costly and time-consuming. That’s why it’s important to design smart algorithms that can figure out the best choice using as few of these comparisons as possible. Our work focuses on building such efficient algorithms. The goal is to learn the best possible option by carefully choosing which pairs of options to compare, making the most of each piece of feedback. What we found is both surprising and exciting: even though preference feedback is less detailed than numerical ratings, it’s still powerful enough to learn almost as well. In fact, we show that the number of comparisons needed to find a near-optimal decision is about the same as if we had access to full numeric scores. This result brings us closer to making machine learning more user-friendly and practical in real-world settings.
Link To Code: https://github.com/ayakayal/BOHF_code_submission
Primary Area: Theory->Online Learning and Bandits
Keywords: Bayesian optimization, human, preferential, preference-based feedback, kernel methods, regret bounds, dueling bandit, RLHF
Submission Number: 7244
Loading