Abstract: We study the multi-armed bandit problem, where the aim is to minimize the simple regret with a fixed budget. The Sequential Halving algorithm is known to tackle it efficiently. We present a more elaborate version of this algorithm to integrate some exterior knowledge or “scores”, that can for instance be provided by a neural network or a heuristic such as all-moves-as-first (AMAF) in the context of a Monte-Carlo Tree Search. We provide both theoretical justifications and experiments.
Loading