Adversarial Policy Gradient for Alternating Markov Games


Nov 03, 2017 (modified: Nov 03, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Policy gradient reinforcement learning has been applied to two-player alternate-turn zero-sum games, e.g., in AlphaGo, self-play REINFORCE was used to improve the neural net model after supervised learning. In this paper, we emphasize that two-player zero-sum games with alternating turns, which have been previously formulated as Alternating Markov Games (AMGs), are different from standard MDP because of its two-agent nature. We exploit the difference in associated Bellman-equations, which leads to different policy iteration algorithms. In principle, as policy gradient method is a kind of generalized policy iteration, we show how these differences in policy iteration are reflected in policy gradient for AMGs. We reformulate an adversarial policy gradient and discuss potential possibilities for developing better policy gradient methods other than self-play REINFORCE. Specifically, the core idea is to estimate the minimum rather than the mean for the “critic”. We show by experimental results that our newly developed Monte-Carlo policy gradient methods are better than the former methods.