From 6235149080811616882909238708 to 29: Vanilla Thompson Sampling Revisited

Bingshan Hu; Tianyue H. Zhang

From 6235149080811616882909238708 to 29: Vanilla Thompson Sampling Revisited

Bingshan Hu, Tianyue H. Zhang

Published: 26 Oct 2023, Last Modified: 13 Dec 2023NeurIPS 2023 Workshop PosterEveryoneRevisionsBibTeX

Keywords: Multi-armed bandit, Thompson Sampling, Posterior Sampling

TL;DR: This work revisits Vanilla Thompson Sampling for stochastic bandits

Abstract: In this work, we derive a new problem-dependent regret bound for Thompson Sampling with Gaussian priors (Algorithm 2 in [Agrawal and Goyal, 2017]), one of the classical stochastic bandit algorithms that has demonstrated excellent empirical performance and been widely deployed in real-world applications. The existing regret bound is $\sum_{i \in [K]: \Delta_i >0}\frac{288 \left(e^{64}+6 \right) \ln \left(T\Delta_i^2 + e^{32} \right)}{\Delta_i} + \frac{10.5}{\Delta_i} + \Delta_i$, where $[K]$ denotes the arm set, $\Delta_i$ denotes the single round performance loss when pulling a sub-optimal arm $i$ instead of the optimal arm, and $T$ is the time horizon. Since real-world learning tasks care about learning algorithms' performance when $T$ is finite, the existing regret bound is only non-vacuous when $T > 288 \cdot e^{64}$, which may not be practical. Our new regret bound is $ \sum_{i \in [K]: \Delta_i >0} \frac{1252 \ln \left(T \Delta_i^2 + 100^{\frac{1}{3}}\right)}{\Delta_i} +\frac{18 \ln \left(T\Delta_i^2 \right)}{\Delta_i} + \frac{182.5}{\Delta_i}+ \Delta_i$, which tightens the leading term's coefficient significantly. Despite having made some improvements, we would like to emphasize that the goal of this work is to deepen the understanding of Thompson Sampling from a theoretical perspective to unlock the full potential of this classical learning algorithm in order to solve challenging real-world learning problems.

Submission Number: 85

Loading