Keywords: decentralized learning, multi-armed bandits, multiplicative weights, distributed gossip model
Abstract: We study a distributed multi-armed bandit setting among
a population of $n$ memory-constrained nodes in the gossip model:
at each round, every node locally adopts one of $m$ arms,
observes a reward drawn from the arm’s (adversarially chosen)
distribution, and then communicates with a randomly sampled neighbor,
exchanging information to determine its policy in the next round.
We introduce and analyze several families of dynamics for this
task that are *decentralized*:
each node's decision is entirely local and depends
only on its most recently obtained reward and that of
the neighbor it sampled.
We show a connection between the global evolution of these
decentralized dynamics with a certain class of
*“zero-sum” multiplicative weight update* algorithms,
and we develop a general framework for analyzing the population-level
regret of these natural protocols.
Using this framework, we derive sublinear regret bounds under a
wide range of parameter regimes (i.e., the size of $m$ and $n$)
in an adversarial reward setting (where the mean
of each arm's distribution can vary over time),
when the number of rounds $T$ is at most logarithmic in $n$.
Further, we show that these protocols can approximately
optimize convex functions over the simplex when the
reward distributions are generated from a stochastic gradient oracle.
Submission Number: 94
Loading