Open Multi-agent Multi-armed Bandit with Applications in Permissionless Blockchain
Abstract: We study a multi-agent multi-armed bandit problem (MA-MAB) in open systems, where multiple agents can enter and leave at any time and face multiple bandit problems to minimize the group-wise cumulative regret. To our knowledge, this is the first work to consider a dynamic set of agents that arrive and depart according to stochastic processes, systematically evolving over time. We also extend to a permissionless blockchain-based MA-MAB (PB-MA-MAB) problem, where agents may behave either honestly or maliciously depending on compliance with the mechanism, and malicious agents may disrupt honest ones. These formulations pose new challenges, as regret grows with the increasing number of agents. To this end, we design new UCB-based methodologies for both MA-MAB and PB-MA-MAB, introducing information-integration rules for existing agents and information-access mechanisms for new agents to fully leverage available information. We derive regret bounds for our algorithms and characterize the complexity of the formulation via regret lower bounds in both settings. We establish regret upper bounds of order $\max\{O(M_0), O(\log T), O(\tfrac{\log^2 T}{(M_0)})1_{\{\lambda > 0\}}\}$ (a significant improvement over the naïve bound $(M_0 + T)\log T$), where $M_0$ is the initial number of agents and $C$ reflects the arrival/departure rate. We also prove lower bounds of $O(\log T)$ and $O(M_0)$ for all consistent algorithms, and tighter bounds of $O(\log T + M_0)$ or $O(\log^2 T)$ for a subset including ours. These imply that our algorithm is nearly optimal in general and optimal in certain cases.
Submission Number: 1930
Loading