Keywords: Fairness of exposure, multi-armed bandit, fairness, multiple-play multi-armed bandit
Abstract: We study a stochastic multiple-play multi-armed bandit (MAB) problem under semi-bandit feedback, where a decision maker selects $K$ arms from the set of $M$ arms under the fairness constraints requiring that each arm should be selected at least a predefined fraction of time. The objective is to maximize cumulative expected rewards while satisfying the fairness constraints. Under mild conditions, we characterize an optimal policy of the fair multiple-play MAB problem and propose a class of algorithms, called Fair-MMAB(K), based on this characterization. We show that Fair-MMAB(K) satisfies the fairness constraints at each time step, regardless of any choice of UCB index, and achieves an $O(1)$ fairness-aware regret when instantiated with UCB1 or KL-UCB. Numerical experiments validate our theoretical findings and demonstrate that Fair-MMAB(K) outperforms existing fair multiple-play MAB algorithms.
Submission Number: 85
Loading