Optimal and Greedy Algorithms for Multi-Armed Bandits with Many Arms

Mohsen Bayati, Nima Hamidi, Ramesh Johari, Khashayar Khosravi

2020 (modified: 05 Nov 2022)CoRR 2020Readers: Everyone

Abstract: We study a Bayesian $k$-armed bandit problem in many-armed regime, when $k \geq \sqrt{T}$, with $T$ the time horizon. We first show that subsampling is critical for designing optimal policies. Specifically, the standard UCB algorithm is sub-optimal while a subsampled UCB (SS-UCB), which samples $\Theta(\sqrt{T})$ arms and executes UCB on that subset, is rate-optimal. Despite theoretically optimal regret, SS-UCB numerically performs worse than a greedy algorithm that pulls the current empirically best arm each time. These empirical insights hold in a contextual setting as well, using simulations on real data. These results suggest a new form of free exploration in the many-armed regime that benefits greedy algorithms. We theoretically show that this source of free exploration is deeply connected to the distribution of a tail event for the prior distribution of arm rewards. This is a fundamentally distinct phenomenon from free exploration due to variation in covariates, as discussed in the recent literature on contextual bandits. Building on this result, we prove that the subsampled greedy algorithm is rate-optimal for Bernoulli bandits in many armed regime, and achieves sublinear regret with more general distributions. Taken together, our results suggest that practitioners may benefit from using greedy algorithms in the many-armed regime.

0 Replies