An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandit Problems

TMLR Paper5872 Authors

11 Sept 2025 (modified: 07 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, $\exp(\beta \langle a, \theta \rangle)/(1+\exp(\beta \langle a, \theta \rangle))$, with parameter $\beta>0$, and both the action $a\in \mathcal{A}$ and the unknown parameter $\theta \in \mathcal{O}$ lie within the $d$-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo & Van Roy (2016), we derive regret bounds via the analysis of the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred by the agent and the information it just gained about the parameter $\theta$. We improve upon previous results and establish that the information ratio is bounded by $d(4/\alpha)^2$, where $d$ is the dimension of the problem and $\alpha$ is a \emph{minimax measure} of the alignment between the action space $\mathcal{A}$ and the parameter space $\mathcal{O}$. Notably, our bound does not scale exponentially with the logistic slope and is independent of the cardinality of the action and parameter spaces. Using this result, we derive a bound on the Thompson Sampling expected regret of order $O(d \alpha^{-1} \sqrt{T \log(\beta T/d)})$, where $T$ is the number of time steps. To our knowledge, this is the first regret bound for any logistic bandit algorithm that avoids any exponential scaling with $\beta$ and is independent of the number of actions. In particular, when the parameters are on the sphere and the action space contains the parameter space, the expected regret bound is of order $O(d \sqrt{T \log(\beta T/d)})$.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We revised the paper to address the reviewer’s comments on clarity, intuition, and discussion of the bound’s dependencies. In Section 2, we added explanations to help readers build intuition about the alignment constant $\\alpha$ and understand why it appears in the analysis. In Section 4, we clarified the geometric interpretation of the dependence on $1/\\alpha$ and added paragraphs discussing the dependence of the regret bound on the time horizon $T$, the problem dimension $d$, and the alignment constant $\\alpha$, emphasizing the tightness of the result. In Section 5, we included additional intuitive remarks explaining how the mutual information and instantaneous regret relate to the expected reward variance, as well as a short preview of how Lemmas 14 and 15 are used in the proof and why introducing the *regret surrogate* is necessary. All newly added and previously modified text is highlighted in dark green for ease of review.
Assigned Action Editor: ~Zheng_Wen1
Submission Number: 5872
Loading