Generalized Objectives in Adaptive Experiments: The Frontier between Regret and Speed

Published: 27 Oct 2023, Last Modified: 22 Dec 2023RealML-2023EveryoneRevisionsBibTeX
Keywords: multi-armed bandits, regret minimization, best-arm identification, Thompson sampling
Abstract: This paper formulates a generalized model of multi-armed bandit experiments that accommodates both cumulative regret minimization and best-arm identification objectives. We identify the optimal instance-dependent scaling of the cumulative cost across experimentation and deployment, which is expressed in the familiar form uncovered by Lai and Robbins (1985). We show that the nature of asymptotically efficient algorithms is nearly independent of the cost functions, emphasizing a remarkable universality phenomenon. Balancing various cost considerations is reduced to an appropriate choice of exploitation rate. Additionally, we explore the Pareto frontier between the length of experiment and the cumulative regret across experimentation and deployment. A notable and universal feature is that even a slight reduction in the exploitation rate (from one to a slightly lower value results) in a substantial decrease in the experiment's length, accompanied by only a minimal increase in the cumulative regret.
Submission Number: 57