Keywords: Online Learning, Multi-armed Bandits
TL;DR: Provide a comprehensive theoretical analysis of bandits with property of slowly-varying reward means---instance dependent upper bounds on regret, minimax regret upper bound, minimax regret lower bound.
Abstract: We consider minimisation of dynamic regret in non-stationary multi-armed bandits with a slowly varying property. Namely, we assume that arms' rewards are stochastic and independent over time, but that the absolute difference between the expected rewards of any arm at any two consecutive time-steps is at most a drift limit $\delta > 0$. For this setting that has not received enough attention in the past, we give a new algorithm which extends naturally the well-known Successive Elimination algorithm to the non-stationary bandit setting. We establish the first instance-dependent regret upper bound for slowly varying non-stationary bandits. The analysis, in turn, relies on a novel characterization of the instance as a {\em detectable gap} profile that depends on the expected arm reward differences. We also provide the first minimax regret lower bound for this problem, enabling us to show that our algorithm is essentially minimax optimal. Also, this lower bound we obtain establishes that the seemingly easier slowly-varying bandits problem is at least as hard as the more general total variation-budgeted bandits problem in the minimax sense. We complement our theoretical results with experimental illustrations.
Submission Number: 303
Loading