- Keywords: Policy Optimization, Nash Equilibrium, Mahjong AI
- Abstract: The deep policy gradient method has demonstrated promising results in many large-scale games, where the agent learns purely from its own experience. Yet, policy gradient methods suffer convergence problems to a Nash Equilibrium (NE) in multi-agent situations. Counterfactual regret minimization has a convergence guarantee to a NE in two-player zero-sum games, but it usually needs domain-specific abstraction techniques and model-based traversing to deal with large-scale games. To inherit merits from both methods, we extend the actor-critic algorithm framework in deep reinforcement learning to solve a large-scale two-player zero-sum imperfect-information game, 1v1 Mahjong, whose information set size and game length are much larger than Poker. In particular, we modify the policy optimization objective from originally maximizing the discounted returns to minimizing a type of weighted cumulative counterfactual regrets. This modification is achieved by approximating the regrets via a deep neural network and minimizing the regrets via generating self-play strategies using Hedge. We name the proposed algorithm Actor-Critic Hedge (ACH) and derive its theoretical connection to CFR. We prove the convergence of ACH to a NE under certain conditions. Experimental results on the proposed 1v1 Mahjong benchmark and benchmarks from OpenSpiel demonstrate that ACH outperforms related state-of-the-art methods. Also, the bot obtained by ACH defeats a human champion in 1v1 Mahjong.
- One-sentence Summary: A new actor-critic algorithm for approximating a Nash Equilibrium in the large-scale imperfect-information game 1v1 Mahjong.
- Supplementary Material: zip