Thompson Sampling For Combinatorial Bandits: Polynomial Regret and Mismatched Sampling Paradox

Raymond Zhang; Richard Combes

Thompson Sampling For Combinatorial Bandits: Polynomial Regret and Mismatched Sampling Paradox

Raymond Zhang, Richard Combes

Published: 25 Sept 2024, Last Modified: 15 Jan 2025NeurIPS 2024 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Combinatorial bandits, Thomspon Sampling

TL;DR: We propose a modified version of Thompson sampling for combinatorial bandits, the first that does not exhibit exponential regret.

Abstract: We consider Thompson Sampling (TS) for linear combinatorial semi-bandits and subgaussian rewards. We propose the first known TS whose finite-time regret does not scale exponentially with the dimension of the problem. We further show the mismatched sampling paradox: A learner who knows the rewards distributions and samples from the correct posterior distribution can perform exponentially worse than a learner who does not know the rewards and simply samples from a well-chosen Gaussian posterior. The code used to generate the experiments is available at https://github.com/RaymZhang/CTS-Mismatched-Paradox

Primary Area: Bandits

Submission Number: 7001

Loading