Recurrent Submodular Welfare and Matroid Blocking Semi-BanditsDownload PDF

Published: 09 Nov 2021, Last Modified: 05 May 2023NeurIPS 2021 PosterReaders: Everyone
Keywords: non-stationary, multi-armed, bandits, blocking, regret, submodular, welfare, maximization, approximation
Abstract: A recent line of research focuses on the study of stochastic multi-armed bandits (MAB), in the case where temporal correlations of specific structure are imposed between the player's actions and the reward distributions of the arms. These correlations lead to (sub-)optimal solutions that exhibit interesting dynamical patterns -- a phenomenon that yields new challenges both from an algorithmic as well as a learning perspective. In this work, we extend the above direction to a combinatorial semi-bandit setting and study a variant of stochastic MAB, where arms are subject to matroid constraints and each arm becomes unavailable (blocked) for a fixed number of rounds after each play. A natural common generalization of the state-of-the-art for blocking bandits, and that for matroid bandits, only guarantees a $\frac{1}{2}$-approximation for general matroids. In this paper we develop the novel technique of correlated (interleaved) scheduling, which allows us to obtain a polynomial-time $(1 - \frac{1}{e})$-approximation algorithm (asymptotically and in expectation) for any matroid. Along the way, we discover an interesting connection to a variant of Submodular Welfare Maximization, for which we provide (asymptotically) matching upper and lower approximability bounds. In the case where the mean arm rewards are unknown, our technique naturally decouples the scheduling from the learning problem, and thus allows to control the $(1-\frac{1}{e})$-approximate regret of a UCB-based adaptation of our online algorithm.
Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.
Supplementary Material: pdf
11 Replies