Algorithm design and sharper bounds for improving bandits

Published: 22 Sept 2025, Last Modified: 01 Dec 2025NeurIPS 2025 WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-Armed Bandits, Improving-Bandits, Data Driven Algorithm Design, Competitive Ratio
TL;DR: We give sharper bounds on competitive ratios for improving bandits and bound the sample complexity of statistical learning of our algorithm parameters.
Abstract: The improving multi-armed bandits problem is a formal model for allocating effort under uncertainty, motivated by scenarios such as investing research effort into new technologies, performing clinical trials, and hyperparameter selection from learning curves. Each pull of an arm provides reward that increases monotonically with diminishing returns. A growing line of work has designed algorithms for this problem, albeit with somewhat pessimistic worst-case guarantees. Indeed, strong lower bounds of $\Omega(k)$ and $\Omega(\sqrt{k})$ multiplicative approximation factors are known for both deterministic and randomized algorithms (respectively) relative to the optimal arm, where $k$ is the number of bandit arms. In this work, we propose a new parameterized family of bandit algorithms and bound the sample complexity of learning the near-optimal algorithm from that family using offline data. This family includes the optimal randomized algorithm from prior work. We show that an appropriately chosen algorithm from this family can achieve stronger guarantees, with optimal dependence on $k$, when the arm reward curves satisfy additional properties related to the strength of concavity. We further bound the sample complexity of learning a near-optimal algorithm from the family using offline data. Taking a statistical learning perspective on the bandit rewards optimization problem, we achieve stronger data-dependent guarantees without the need for actually verifying whether the assumptions are satisfied.
Submission Number: 72
Loading