Constrained Feedback Learning for Non-Stationary Multi-Armed Bandits

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Non-stationary bandits, constrained feedback, query budget, query-reward tradeoff
TL;DR: We propose the first prior-free algorithm that achieves near-optimal dynamic regret for non-stationary multi-armed bandits under constrained feedback.
Abstract: Non-stationary multi-armed bandits (nsMAB) enable agents to adapt to changing environments by incorporating mechanisms to detect and respond to shifts in reward distributions, making them well-suited for dynamic settings. However, existing approaches typically assume that reward feedback is available at every round—an assumption that overlooks many real-world scenarios where feedback is limited. In this paper, we take a significant step forward by introducing a new model of *constrained feedback in non-stationary multi-armed bandits* (ConFee-nsMAB), where the availability of reward feedback is restricted. We propose the first prior-free algorithm—that is, one that does not require prior knowledge of the degree of non-stationarity—that achieves near-optimal dynamic regret in this setting. Specifically, our algorithm attains a dynamic regret of $\tilde {\mathcal{O}}({K^{1/3} V_T^{1/3} T }/{ B^{1/3}})$, where $T$ is the number of rounds, $K$ is the number of arms, $B$ is the query budget, and $V_T$ is the variation budget capturing the degree of non-stationarity.
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 3446
Loading