Keywords: Zero-shot coordination, offline learning, online fine-tuning
Abstract: A central challenge in multi-agent reinforcement learning is zero-shot coordination (ZSC): the ability of agents to collaborate with previously unseen partners. Existing approaches, such as population-based training or convention-avoidance methods, improve ZSC but typically rely on extensive online interaction, leading to high sample complexity. A natural alternative is to leverage preexisting interaction datasets through offline learning. However, offline training alone is insufficient for effective ZSC, as agents tend to overfit to the conventions present in the dataset and struggle to adapt to novel partners. To address this limitation, we propose an \emph{offline-to-online ZSC} framework that combines offline dataset diversity with efficient online adaptation. In the offline stage, trajectories are embedded and clustered into behavioral modes to train specialized agents and their belief models, from which a best-response agent is learned. In the online stage, this agent is refined through belief-guided counterfactual rollouts, where belief models simulate alternative successor states under different teammate behaviors, thereby expanding the training distribution beyond the dataset. Experiments on the ZSC benchmark \textit{Hanabi} in 2-player settings, as well as in human-AI coordination, demonstrate that our approach achieves state-of-the-art performance with unseen partners while significantly reducing the amount of online interaction.
Primary Area: reinforcement learning
Submission Number: 20954
Loading