Abstract: An online learning problem with side information is considered. The problem is formulated as a graph structured stochastic Multi-Armed Bandit (MAB). Each node in the graph represents an arm in the bandit problem and an edge between two arms indicates closeness in their mean rewards. It is shown that such side information induces a Unit Interval Graph and several graph properties can be leveraged to achieve a sublinear regret in the number of arms while preserving the optimal logarithmic regret in time. A lower bound on regret is established and a hierarchical learning policy that is order optimal in terms of both the number of arms and the learning horizon is developed.
Loading