Abstract:

Classic no-regret online prediction algorithms, including variants of $\texttt{Hedge}$ in the full-information setting and variants of $\texttt{UCB}$ in the bandit feedback setting, are inherently unfair by design, as they aim to play the most rewarding arm as many times as possible while ignoring the less rewarding arms. In this paper, we consider a fair prediction problem in the stochastic setting with hard lower bounds on the rate of accrual of rewards for each arm. We study the problem in both full and bandit feedback settings. Using queueing theoretic techniques in conjunction with adversarial online learning, we propose a new online prediction policy, called $\texttt{BanditQ}$, that achieves the target rate constraints while achieving instance-independent regrets of $O(T^{\frac{3}{4}}).$ Furthermore, the above regret bound can be improved to $O(\sqrt{T})$ when considering average regret over the entire horizon. The proposed $\texttt{BanditQ}$ policy is efficient and admits a black-box reduction of the fair prediction problem to the MAB problem. The design and analysis of $\texttt{BanditQ}$ involve a novel use of the potential function method in conjunction with recent scale-free second-order MAB regret bounds and a certain self-bounding inequality for the reward gradients and are of independent interest. 

TLDR

We design a fair online MAB policy that meets given feasible target reward rates for each arm.


Also check out the paper checklist guideline