Abstract: Deep reinforcement learning has achieved superhuman performance in zero-sum games such as Go and Poker in recent years. In the real world, however, many scenarios are non-zero-sum settings, meaning that success feels the necessity for cooperation and communication rather than competition. Hanabi game has been established as an ideal benchmark for agents to learn to cooperate adequately with other agents and humans. The Bayesian action decoder methods perform well on the 2 players Hanabi game while there remains a large performance gap between the numbers achieved by these methods and the performance of hat-coding strategies on the 3–5 players settings. The pivotal problem is the contradiction of the exploration of actions against the exploitation of observed actions. We present a novel deep multi-agent reinforcement learning method, the Modified Action Decoder to resolve this problem leveraging centralized training with decentralized execution paradigm. During the training phase, agents not only observe the exploratory action selected but also observe the optimal action of their teammates for better exploitation. We verify our method on Hanabi game in the 2–5 players setting, and it is superior to previously published reinforcement learning methods and establishes a new state of the art.
Loading