Abstract: This article studies a distributed multiarmed bandit problem with heterogeneous observations of rewards. The problem is cooperatively solved by <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$N$</tex-math></inline-formula> agents assuming each agent faces a common set of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$M$</tex-math></inline-formula> arms yet observes only local biased rewards of the arms. The goal of each agent is to minimize the cumulative expected regret with respect to the true rewards of the arms, where the mean of each arm's true reward equals the average of the means of all agents' observed biased rewards. Each agent recursively updates its decision by utilizing the information from its neighbors. Neighbor relationships are described by a time-dependent directed graph <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\mathbb{G}(t)$</tex-math></inline-formula> whose vertices correspond to agents and whose arcs depict neighbor relationships. A fully distributed bandit algorithm is proposed, which couples the classical distributed averaging algorithm and the celebrated upper confidence bound bandit algorithm. It is shown that for any uniformly strongly connected sequence of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\mathbb{G}(t)$</tex-math></inline-formula> , the algorithm achieves guaranteed regret for each agent at the order of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$O(\log T)$</tex-math></inline-formula> .
0 Replies
Loading