An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandit Problems

TMLR Paper5872 Authors

11 Sept 2025 (modified: 16 Sept 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, $\exp(\beta \langle a, \theta \rangle)/(1+\exp(\beta \langle a, \theta \rangle))$, with parameter $\beta>0$, and both the action $a\in \mathcal{A}$ and the unknown parameter $\theta \in \mathcal{O}$ lie within the $d$-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo & Van Roy (2016), we derive regret bounds via the analysis of the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred by the agent and the information it just gained about the parameter $\theta$. We improve upon previous results and establish that the information ratio is bounded by $d(4/\alpha)^2$, where $d$ is the dimension of the problem and $\alpha$ is a \emph{minimax measure} of the alignment between the action space $\mathcal{A}$ and the parameter space $\mathcal{O}$. Notably, our bound does not scale exponentially with the logistic slope and is independent of the cardinality of the action and parameter spaces. Using this result, we derive a bound on the Thompson Sampling expected regret of order $O(d \alpha^{-1} \sqrt{T \log(\beta T/d)})$, where $T$ is the number of time steps. To our knowledge, this is the \emph{first regret bound for any logistic bandit algorithm} that avoids any exponential scaling with $\beta$ and is independent of the number of actions. In particular, when the parameters are on the sphere and the action space contains the parameter space, the expected regret bound is of order $O(d \sqrt{T \log(\beta T/d)})$.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Corrected minor typos and replaced equality (k) by an inequality in Proof 13. Fixed the rate of GLM-TSL (Kveton et al., 2020) in Table 1.
Assigned Action Editor: ~Zheng_Wen1
Submission Number: 5872
Loading