Abstract: We study a stochastic multi-armed bandit setting where arms are partitioned into known
clusters, such that the parameters of arms within a cluster differ by at most a known thresh-
old. While the clustering structure is known a priori, the arm parameters are unknown. We
derive an asymptotic lower bound on the regret that improves upon the classical bound
of Lai & Robbins (1985). We then propose Clus-UCB, an efficient algorithm that closely
matches this lower bound asymptotically by exploiting the clustering structure and intro-
ducing a new index to evaluate an arm, which depends on other arms within the cluster. In
this way, arms share information among each other. We present simulation results of our
algorithm and compare its performance against KL-UCB and other well-known algorithms
for bandits with dependent arms. We discuss the robustness of the proposed algorithm
under misspecified prior information, address some limitations of this work, and conclude
by outlining possible directions for future research.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Added subsection 5.1 to discuss when TLP should be used, and when Clus-UCB might be preferable.
Assigned Action Editor: ~Junpei_Komiyama1
Submission Number: 5652
Loading