Abstract: Non-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson Sampling (TS) has shown empirical success in non-stationary settings, there is currently no regret bound analysis for TS with Gaussian priors. To address this, we propose two algorithms, discounted TS and sliding-window TS, designed for sub-Gaussian reward distributions. For these algorithms, we establish an upper bound for the expected regret by bounding the expected number of times a suboptimal arm is played. We show that the regret order of both algorithms is $\tilde{O}(\sqrt{TB_T})$, where $T$ is the time horizon, $B_T$ is the number of breakpoints. This upper bound matches the lower bound for abruptly changing problems up to a logarithmic factor. Empirical comparisons with other non-stationary bandit algorithms highlight the competitive performance of our proposed methods.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Stefan_Magureanu1
Submission Number: 3202
Loading