Keywords: Multi-Agent Reinforcement Learning, Reinforcement Learning, Cooperative Multi-Agent Reinforcement Learning, Deep Reinforcement Learning
TL;DR: By training in an off-team manner, we can mitigate the training and testing time covariate shift of off-belief learning, resulting in near optimal zero-shot coordination and mitigate covariate shift in ad-hoc teamplay and proxy human-AI.
Abstract: Zero-shot coordination (ZSC) evaluates an algorithm by the performance of a team of agents that were trained independently under that algorithm. Off-belief learning (OBL) is a recent method that achieves state-of-the-art results in ZSC in the game Hanabi. However, the implementation of OBL relies on a belief model that experiences covariate shift. Moreover, during ad-hoc coordination, OBL or any other neural policy may experience test-time covariate shift. We present two methods addressing these issues. The first method, off-team belief learning (OTBL), attempts to improve the accuracy of the belief model of a target policy πT on a broader range of inputs by weighting trajectories approximately according to the distribution induced by a different policy πb. The second, off-team off-belief learning (OT-OBL), attempts to compute an OBL equilibrium, where fixed point error is weighted according to the distribution induced by cross-play between the training policy π and a different fixed policy πb instead of self-play of π. We investigate these methods in variants of Hanabi.
Supplementary Material: pdf