Abstract: In this paper, we study the problem of feature selection in distributed stochastic multi-arm bandits, in which M agents work collaboratively to choose optimal actions under the coordination of a central server in order to minimize the total regret. We consider a learning situation where there is a set of feature maps, each map is best suited for a certain state of the system, and the best feature map is unknown to the agent at the time of learning. In our model, an adversary chooses a distribution on the set of possible feature maps and the agents observe only the distribution and the true feature map is unknown to the agents. Our goal is to develop a distributed algorithm that selects a sequence of optimal actions to maximize the cumulative reward. By performing a feature vector transformation we propose an elimination algorithm and prove that our algorithm achieves regret and communications bounds of $O\left( {d\sqrt {MT} \log T} \right)$ and O((Md+d log logd)logT), respectively, for linearly parametrized reward functions, where T is the horizon and d is the size of the feature vector. We validated the performance of our approach through numerical simulations.
Loading