Keywords: contextual bandits, logistic bandits, simple regret, Thompson sampling
Abstract: We study stochastic contextual logistic bandits under the simple regret objective. While simple regret guarantees are known for the linear case, no such results existed for the logistic setting. Building on ideas from contextual linear bandits and self-concordant analysis, we propose the first algorithm that achieves simple regret $\tilde{\mathcal{O}}(d/\sqrt{T})$. Notably, the leading term of our regret bound is free of $\kappa=\mathcal O(\exp(S))$, where $S$ is a bound on the magnitude of the unknown parameter vector, while the algorithm remains computationally tractable for finite action sets. We also introduce a new variant of Thompson Sampling adapted to the simple-regret setting, which yields the first simple regret guarantee for randomized algorithms in stochastic contextual linear bandits. Extending these tools to the logistic case, we obtain a Thompson Sampling variant with regret $\tilde{\mathcal O}(d^{3/2}/\sqrt{T})$, again free of $\kappa$ in the leading term. The randomized algorithms, as expected, are cheaper to run than their deterministic
counterparts. Finally, we conducted a series of experiments to empirically validate these theoretical guarantees.
Supplementary Material: zip
Primary Area: learning theory
Submission Number: 13299
Loading