Combinatorial Dueling Bandits

17 Sept 2025 (modified: 18 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Combinatorial Bandits, Dueling bandits
TL;DR: This paper introduces the first systematic study of Contextual Combinatorial Dueling Bandits, a novel framework for online decision-making with preference feedback. We propose two algorithms with theoretical guarantees for linear and nonlinear cases.
Abstract: We introduce the \emph{Contextual Combinatorial Dueling Bandits (CDB)} problem, a novel framework for modeling complex online decision-making under relative and binary feedback. In each round, the learner observes contextual information for a set of arms and selects two subsets of $k$ arms, termed \emph{super arms}. The feedback consists of pairwise binary preferences between the arms in the two chosen super arms. For example, in recommendation systems, a user might be shown two competing sets of items and provide preference feedback for each pair of items. We propose two algorithms to address this problem: \emph{LinCDB} for linear score functions and \emph{NCDB} for nonlinear cases. Both algorithms leverage the Hungarian algorithm for efficient selection of the second super arm. We theoretically demonstrate that LinCDB achieves a regret bound of $\widetilde{O}\left( \frac{d}{\kappa_\mu} \sqrt{Tk} \right)$, while NCDB achieves $\widetilde{O}\left( \left(\frac{1}{\kappa_\mu} \sqrt{\widetilde{d}} + B \sqrt{\frac{\lambda}{\kappa_\nu}} \right) \sqrt{Tk\widetilde{d} } \right)$. Here, $d$ represents the dimension of the context for each arm, $k$ is the size of the super arm, and $\widetilde{d}$ denotes the effective dimension. To our knowledge, this is the first work to study combinatorial bandits with preference feedback.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 9445
Loading