K-level Reasoning for Zero-Shot Coordination in HanabiDownload PDF

21 May 2021, 20:51 (modified: 31 Jan 2022, 06:31)NeurIPS 2021 PosterReaders: Everyone
Keywords: Multi-Agent Reinforcement Learning, Reinforcement Learning, Zero-Shot Coordination, Deep Reinforcement Learning
TL;DR: With a simple engineering optimization, jointly training all levels of a K-Level Reasoning Hierarchy, we are able to stabilize and improve Zero-Shot Coordination results in Hanabi.
Abstract: The standard problem setting in cooperative multi-agent settings is \emph{self-play} (SP), where the goal is to train a \emph{team} of agents that works well together. However, optimal SP policies commonly contain arbitrary conventions (``handshakes'') and are not compatible with other, independently trained agents or humans. This latter desiderata was recently formalized by \cite{Hu2020-OtherPlay} as the \emph{zero-shot coordination} (ZSC) setting and partially addressed with their \emph{Other-Play} (OP) algorithm, which showed improved ZSC and human-AI performance in the card game Hanabi. OP assumes access to the symmetries of the environment and prevents agents from breaking these in a mutually \emph{incompatible} way during training. However, as the authors point out, discovering symmetries for a given environment is a computationally hard problem. Instead, we show that through a simple adaption of k-level reasoning (KLR) \cite{Costa-Gomes2006-K-level}, synchronously training all levels, we can obtain competitive ZSC and ad-hoc teamplay performance in Hanabi, including when paired with a human-like proxy bot. We also introduce a new method, synchronous-k-level reasoning with a best response (SyKLRBR), which further improves performance on our synchronous KLR by co-training a best response.
Supplementary Material: pdf
Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.
21 Replies

Loading