Regional Multi-Armed Bandits With Partial Informativeness

Zhiyang Wang, Ruida Zhou, Cong Shen

2018 (modified: 04 Nov 2022)IEEE Trans. Signal Process. 2018Readers: Everyone

Abstract: We consider a variant of the classic multi-armed bandit problem where the expected reward of each arm is a function of an unknown parameter. The arms are divided into different groups, each of which has a common parameter. Therefore, when the player selects an arm at each time slot, information of other arms in the same group is also revealed. This <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">regional bandit model naturally bridges the classical <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">non-informative bandit setting where the player can only learn the chosen arm, and the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">global bandit model where sampling one arm reveals information of all arms. We propose an efficient algorithm, <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">UCB-g , that solves the regional bandit model by combining the Upper Confidence Bound (UCB) and greedy principles. Both parameter-dependent and parameter-free regret upper bounds are derived. We also establish a matching lower bound, which proves the order optimality of <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">UCB-g . Moreover, we propose <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">SW-UCB-g , which is an extension of <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">UCB-g for a non-stationary environment where the parameters vary over time.

0 Replies