Best-of-three-worlds Analysis for Dueling Bandits with Borda Winner

Best-of-three-worlds Analysis for Dueling Bandits with Borda Winner

ICLR 2026 Conference Submission18375 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: dueling bandits; borda winner; best of three worlds; FTRL

Abstract: The dueling bandits (DB) problem addresses online learning from relative preferences, where the learner queries pairs of arms and receives binary win-loss feedback. Most existing work focuses on designing algorithms for specific stochastic or adversarial environments. Recently, a unified algorithm has been proposed that achieves convergence across all settings. However, this approach relies on the existence of a Condorcet winner, which is often not achievable, particularly when the preference matrix changes in the adversarial setting. Aiming for a more general Borda winner objective, there currently exists no unified framework that simultaneously achieves optimal regret across these environments. In this paper, we explore how the follow-the-regularized-leader (FTRL) algorithm can be employed to achieve this objective. We propose a hybrid negative entropy regularizer and demonstrate that it enables us to achieve $\tilde{O}(K^{1/3} T^{2/3})$ regret in the adversarial setting, ${O}({K \log^2 T}/{\Delta_{\min}^2})$ regret in the stochastic setting, and $O({K \log^2 T }/{\Delta_{\min}^2} + ({C^2 K \log^2 T }/{\Delta_{\min}^2})^{1/3})$ regret in the corrupted setting, where $K$ is the arm set size, $T$ is the horizon, $\Delta_{\min}$ is the minimum gap between the optimal and sub-optimal arms, and $C$ is the corruption level. These results align with the state-of-the-art in individual settings, while eliminating the need to assume a specific environment type. We also present experimental results demonstrating the advantages of our algorithm over baseline methods across different environments.

Primary Area: learning theory

Submission Number: 18375

Loading