Keywords: Multi-armed bandit, Thompson Sampling, Posterior Sampling
TL;DR: This work revisits Vanilla Thompson Sampling for stochastic bandits
Abstract: In this work, we derive a new problem-dependent regret bound for Thompson Sampling with Gaussian priors (Algorithm 2 in [Agrawal and Goyal, 2017]), one of the classical stochastic bandit algorithms that has demonstrated excellent empirical performance and been widely deployed in real-world applications. The existing regret bound is $\sum_{i \in [K]: \Delta_i >0}\frac{288 \left(e^{64}+6 \right) \ln \left(T\Delta_i^2 + e^{32} \right)}{\Delta_i} + \frac{10.5}{\Delta_i} + \Delta_i$, where $[K]$ denotes the arm set, $\Delta_i$ denotes the single round performance loss when pulling a sub-optimal arm $i$ instead of the optimal arm, and $T$ is the time horizon. Since real-world learning tasks care about learning algorithms' performance when $T$ is finite, the existing regret bound is only non-vacuous when $T > 288 \cdot e^{64}$, which may not be practical. Our new regret bound is $ \sum_{i \in [K]: \Delta_i >0} \frac{1252 \ln \left(T \Delta_i^2 + 100^{\frac{1}{3}}\right)}{\Delta_i} +\frac{18 \ln \left(T\Delta_i^2 \right)}{\Delta_i} + \frac{182.5}{\Delta_i}+ \Delta_i$, which tightens the leading term's coefficient significantly. Despite having made some improvements, we would like to emphasize that the goal of this work is to deepen the understanding of Thompson Sampling from a theoretical perspective to unlock the full potential of this classical learning algorithm in order to solve challenging real-world learning problems.
Submission Number: 85
Loading