Regional Multi-Armed Bandits With Partial InformativenessDownload PDFOpen Website

2018 (modified: 04 Nov 2022)IEEE Trans. Signal Process. 2018Readers: Everyone
Abstract: We consider a variant of the classic multi-armed bandit problem where the expected reward of each arm is a function of an unknown parameter. The arms are divided into different groups, each of which has a common parameter. Therefore, when the player selects an arm at each time slot, information of other arms in the same group is also revealed. This <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">regional bandit</i> model naturally bridges the classical <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">non-informative bandit</i> setting where the player can only learn the chosen arm, and the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">global bandit</i> model where sampling one arm reveals information of all arms. We propose an efficient algorithm, <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">UCB-g</b> , that solves the regional bandit model by combining the Upper Confidence Bound (UCB) and greedy principles. Both parameter-dependent and parameter-free regret upper bounds are derived. We also establish a matching lower bound, which proves the order optimality of <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">UCB-g</b> . Moreover, we propose <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">SW-UCB-g</b> , which is an extension of <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">UCB-g </b> for a non-stationary environment where the parameters vary over time.
0 Replies

Loading