Keywords: contextual dueling bandits, relative, dueling feedback, preference, offline regression oracle, regret minimization, continuous decision spaces
TL;DR: We design the first efficient, near-optimal regret algorithm for contextual dueling bandits using offline oracles, enabling scalable preference-based learning in RLHF and resolving a key open problem in AI alignment.
Abstract: The problem of contextual dueling bandits is central to reinforcement learning with human feedback (RLHF), a widely used approach in AI alignment for incorporating human preferences into learning systems. Despite its importance, existing methods are constrained either by strong preference modeling assumptions or by applicability only to finite action spaces. Moreover, prior algorithms typically rely on online optimization oracles, which are computationally infeasible for complex function classes, limiting their practical effectiveness. In this work, we present the first fundamental theoretical study of general contextual dueling bandits over continuous action spaces. Our key contribution is a novel algorithm based on a regularized min-max optimization framework that achieves a regret bound of $\tilde{O}(\sqrt{dT})$—the first such guarantee for this general setting. By leveraging offline oracles instead of online ones, our method further improves computational efficiency. Empirical evaluations validate our theoretical findings, with our approach significantly outperforming existing baselines in terms of regret.
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 26496
Loading