An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits

Published: 10 Oct 2024, Last Modified: 06 Dec 2024NeurIPS BDU Workshop 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mutli–armed bandit, logistic bandit, Thompson Sampling, information theory, regret bounds, online optimization
Abstract: We study the performance of the Thompson Sampling algorithm for logistic bandit problems, where the agent receives binary rewards with probabilities determined by a logistic function $\exp(\beta \langle a, \theta \rangle)/(1+\exp(\beta \langle a, \theta \rangle))$. We focus on the setting where the action $a$ and parameter $\theta$ lie within the $d$-dimensional unit ball with the action space encompassing the parameter space. Adopting the information-theoretic framework introduced by Russo and Van Roy (2015), we analyze the information ratio, which is defined as the ratio of the expected squared difference between the optimal and actual rewards to the mutual information between the optimal action and the reward. Improving upon previous results, we establish that the information ratio is bounded by $\tfrac{9}{2}d$. Notably, we obtain a regret bound in $O(d\sqrt{T \log(\beta T/d)})$ that depends only logarithmically on the parameter $\beta$.
Submission Number: 34
Loading