TL;DR: First optimal and adaptive regret bounds for non-stationary infinite-armed bandits
Abstract: We study an infinite-armed bandit problem where actions' mean rewards are initially sampled from a _reservoir distribution_. Most prior works in this setting focused on stationary rewards (Berry et al., 1997; Wang et al., 2008; Bonald and Proutiere, 2013; Carpentier and Valko, 2015) with the more challenging adversarial/non-stationary variant only recently studied in the context of rotting/decreasing rewards (Kim et al., 2022; 2024). Furthermore, optimal regret upper bounds were only achieved using parameter knowledge of non-stationarity and only known for certain regimes of regularity of the reservoir. This work shows the first parameter-free optimal regret bounds while also relaxing these distributional assumptions. We also study a natural notion of _significant shift_ for this problem inspired by recent developments in finite-armed MAB (Suk & Kpotufe, 2022). We show that tighter regret bounds in terms of significant shifts can be adaptively attained. Our enhanced rates only depend on the rotting non-stationarity and thus exhibit an interesting phenomenon for this problem where rising non-stationarity does not factor into the difficulty of non-stationarity.
Lay Summary: We study the multi-armed bandit problem (used e.g. for clinical trials or recommender systems) where one chooses from infinitely many options, or _arms_, whose rewards change over time. Past work on infinite-armed bandits mostly assumed stable reward models or needed to know in advance how rewards would change. On the other hand, changing reward models are well-studied for the simpler finite-armed bandit problem.
Our work introduces the first algorithm that automatically tunes itself to changing rewards without needing prior knowledge about the changes. We provide mathematical proofs that our method achieves the best possible performance guarantees (measured in a worst-case sense) in terms of the number or amount of changes in rewards. We also show how to focus only on meaningful changes in rewards, leading to faster learning when there are few changes in rewards which are _significantly_ harmful. For this, we introduce a new notion of significant changes for the infinite-armed setting, inspired by previous works in the finite-armed analogue of the problem.
Link To Code: https://github.com/joesuk/NonStationaryInfiniteBandits
Primary Area: Theory->Online Learning and Bandits
Keywords: non-stationary, infinite-armed bandits, bandits, adaptive, parameter-free
Submission Number: 14788
Loading