Thompson Sampling for Constrained Bandits

Rohan Deb; Mohammad Ghavamzadeh; Arindam Banerjee

Thompson Sampling for Constrained Bandits

Rohan Deb, Mohammad Ghavamzadeh, Arindam Banerjee

Published: 09 May 2025, Last Modified: 06 Sept 2025RLC 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Bandits with Knapsacks, Thompson Sampling, Conservative Bandits

TL;DR: We propose Thompson Sampling algorithms for constrained contextual bandits, providing theoretical guarantees and empirical comparisons with UCB-based methods.

Abstract: Contextual bandits model sequential decision-making where an agent balances exploration and exploitation to maximize long-term cumulative rewards. Many real-world applications, such as online advertising and inventory pricing, impose additional resource constraints, while in high-stakes settings like healthcare and finance, early-stage exploration can pose significant risks. The Contextual Bandit with Knapsack (CBwK) framework extends contextual bandits to incorporate resource constraints while the Contextual Conservative Bandit (CCB) framework ensures that performance of the learner remains above $(1-\alpha)$ times the performance of a predefined safe baseline. Although Upper Confidence Bound (UCB) based methods exist for both setups, a Thompson Sampling (TS) based approach has not been explored. This gap in the literature motivates the need to study TS for constrained settings, further reinforced by the fact that TS often demonstrates superior empirical performance in the unconstrained setting. In this work we consider linear CBwK and CCB setups and design Thompson sampling algorithms LinCBwK-TS and LinCCB-TS respectively. We provide a $\tilde{O}\big((\frac{\text{OPT}}{B}+1)m\sqrt{T}\big)$ regret for \LinCBwKTS\; where $\text{OPT}$ is the optimal value and $B$ is the total budget. Further, we show that \LinCCBTS\; has a regret bounded by $\tilde{O}\big(\sqrt{T}\min\{m^{3/2},m\sqrt{\log K}\} + {m^3\Delta_h}/{\alpha r_l (\Delta_l + \alpha r_l)}\big)$ and maintains the performance guarantee with high probability, where $\Delta_h$ and $\Delta_l$ are the upper and lower bounds on the baseline gap and $r_l$ is a lower-bound on the baseline reward.

Submission Number: 365

Loading