A Bayesian Approach to Learning Command Hierarchies for Zero-Shot Multi-Agent Coordination

Timothy Flavin; Sandip Sen

A Bayesian Approach to Learning Command Hierarchies for Zero-Shot Multi-Agent Coordination

Timothy Flavin, Sandip Sen

Published: 13 Mar 2024, Last Modified: 22 Apr 2024ALA 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-Agent Reinforcement learning, Multi Armed Bandits, Learned Communication Protocol, Zero-Shot Learning

TL;DR: Our decentralized method learns a command hierarchy between agents to increase coordination/performance in zero and few-shot scenarios where agents of varying skill levels have never seen eachother before.

Abstract: An ongoing challenge in Multi-Agent reinforcement learning (MARL) is to develop algorithms which can leverage very limited amounts of experience to coordinate with new teammates. Recent work have focused on generating diverse training partners and generalizing to those partners. In this paper, we propose a modular solution called Multi-Armed Two-way Command Heuristic (MATCH) that can be added on to existing agents to learn a command hierarchy within a single episode so that a group of agents may approach the competency of the best agent in the group. We view learning to communicate as a set of non-stationary multi-armed bandit (MAB) problems where each agent has dedicated incoming and outgoing command MAB samplers that adjusts their policies. When giving commands, each agent's goal is to choose the subject who is most likely to follow their command. When receiving commands, each agent uses it's own estimate of the expected benefit of following a particular commander to decide who to follow. We show that competent agents are able to quickly adapt to incompetent teammates by commanding and ignoring them whereas incompetent agents learn to follow commands from more skilled teammates. If pre-trained agents are capable of sending or receiving commands before adding our communication structure, the agent's desired actions are used as a prior distribution which will influence the MAB samplers to mitigate exploration regret.

Type Of Paper: Full paper (max page 8)

Anonymous Submission: Anonymized submission.

Submission Number: 17

Loading