Thompson Sampling For Bandits With Cool-Down Periods

TMLR Paper4818 Authors

10 May 2025 (modified: 17 May 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper investigates a variation of dynamic bandits, characterized by arms that follow a periodic availability pattern. Upon a "successful" selection, each arm transitions to an inactive state and requires a possibly unknown cool-down period before becoming active again. We devise Thompson Sampling algorithms specifically designed for this problem, guaranteeing logarithmic regrets. Notably, this work is the first to address scenarios in which the agent lacks knowledge of each arm's active state. Furthermore, the theoretical findings extend to the sleeping bandit framework, offering a notably superior regret bound compared to existing literature.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Alberto_Maria_Metelli2
Submission Number: 4818
Loading