Keywords: confidence ellipsoid, sample complexity, pure exploration, probably approximately correct, non-convexity
TL;DR: We propose an algorithm for estimating the best mean reward in a multi-armed bandit with asymptotically optimal, instance-adaptive sample complexity.
Abstract: We study the problem of estimating the mean reward of the best arm in a multi-armed bandit (MAB) setting. Specifically, given a target precision $\varepsilon$ and confidence level $1-\delta$, the goal is to return an $\varepsilon$-accurate estimate of the largest mean reward with probability at least $1-\delta$, while minimizing the number of samples. We first establish an instance-dependent lower bound on the sample complexity, which requires handling the infinitely many possible candidates of the estimated best mean. This lower bound is expressed in a non-convex optimization problem, which becomes the main difficulty of this problem, preventing the direct application of standard techniques such as Track-and-Stop to provably achieve optimality. To overcome this difficulty, we introduce several new algorithmic and analytical techniques and propose an algorithm that achieves the asymptotic lower bound with matching constants in the leading term. Our method combines a confidence ellipsoid-based stopping condition with a two-phase sampling strategy tailored to manage non-convexity proposed algorithm is simple, nearly free of hyperparameters, and achieves the instance-dependent, asymptotically optimal sample complexity.  Experimental results support our theoretical guarantees and demonstrate the practical effectiveness of our method.
Supplementary Material:  zip
Primary Area: General machine learning (supervised, unsupervised, online, active, etc.)
Submission Number: 8427
Loading