Keywords: Multi-Agent Multi-Armed Bandits, Free Exploration, Heterogeneous Reward
TL;DR: We study the free exploration mechanism in the multi-agent multi-armed bandits with heterogeneous reward model.
Abstract: This paper studies a cooperative multi-agent bandit scenario in which the rewards observed by agents are heterogeneous---one agent's meat can be another agent's poison. Specifically, the total reward observed by each agent is the sum of two values: an arm-specific reward, capturing the intrinsic value of the arm, and a privately-known agent-specific reward, which captures the personal preference/limitations of the agent. This heterogeneity in total reward leads to different local optimal arms for agents but creates an opportunity for *free exploration* in a cooperative setting---an agent can freely explore its local optimal arm with no regret and share this free observation with some other agents who would suffer regrets if they pull this arm since the arm is not optimal for them. We first characterize a regret lower bound that captures free exploration, i.e., arms that can be freely explored have no contribution to the regret lower bound. Then, we present a cooperative bandit algorithm that takes advantage of free exploration and achieves a near-optimal regret upper bound which tightly matches the regret lower bound up to a constant factor. Lastly, we run numerical simulations to compare our algorithm with various baselines without free exploration.
Supplementary Material: pdf
Other Supplementary Material: zip