Keywords: multi-armed bandits, non-stationary, contextual bandits, nonparametric, Lipschitz
TL;DR: Optimal and adaptive dynamic regret bounds for non-stationary nonparametric contextual bandits in terms of new, tighter non-stationarity measure .
Abstract: We study nonparametric contextual bandits where Lipschitz mean reward functions may change over time.
We first establish the minimax dynamic regret rate in this less understood setting in terms of number of changes $L$ and total-variation $V$, both capturing all changes in distribution over context space, and argue that state-of-the-art procedures are suboptimal in this setting.
Next, we tend to the question of an _adaptivity_ for this setting, i.e. achieving the minimax rate without knowledge of $L$ or $V$. Quite importantly, we posit that the bandit problem, viewed locally at a given context $X_t$, should not be affected by reward changes in other parts of context space $\cal X$. We therefore propose a notion of _change_, which we term _experienced significant shifts_, that better accounts for locality, and thus counts considerably less changes than $L$ and $V$. Furthermore, similar to recent work on non-stationary MAB (Suk & Kpotufe, 2022), _experienced significant shifts_ only count the most _significant_ changes in mean rewards, e.g., severe best-arm changes relevant to observed contexts.
Our main result is to show that this more tolerant notion of change can in fact be adapted to.
Supplementary Material: pdf
Submission Number: 15164
Loading