Keywords: UCB, Inconsistent Preferences, Near-optimal Regret
TL;DR: We identified a weak assumption, under which the UCB algorithm can provably achieve near-optimal regret while learning from inconsistent preferences.
Abstract: In this paper, we study how to learn from inconsistent human feedback in the setting of combinatorial bandits with semi-bandit feedback -- where an online learner in every time step chooses a size-$k$ set of arms, observes a stochastic reward for each arm, and endeavors to maximize the sum of the per-arm rewards in the set. We consider the challenging setting where these per-arm rewards are not only set-dependent, but also {\em inconsistent:} the expected reward of arm "a" can be larger than arm "b" in one set, but smaller in another. Inconsistency is often observed in practice, falls outside the purview of many popular semi-bandit models, and in general can result in it being combinatorially hard to find the optimal set.
Motivated by the observed practice of using UCB-based algorithms even in settings where they are not strictly justified, our main contribution is to present a simple assumption - weak optimal set consistency. We show that this assumption allows for inconsistent set-dependent arm rewards, and also subsumes many widely used models for semi-bandit feedback. Most importantly, we show that it ensures that a simple UCB-based algorithm finds the optimal set, and achieves $O\left(\min(\frac{k^3 n \log T}{\epsilon}, k^2\sqrt{n T \log T})\right)$ regret (which nearly matches the lower bound).
Submission Number: 15
Loading