- Keywords: zero sum Markov-games, policy gradient, actor-critic, temporal difference
- Abstract: We introduce algorithms based on natural policy gradient and two time-scale natural actor-critic, and analyze their sample complexity for solving two player zero-sum Markov games in the tabular case. Our results improve the best-known sample complexities of policy gradient/actor-critic methods for convergence to Nash equilibrium in the multi-agent setting. We use the error propagation scheme in approximate dynamic programming, recent advances for global convergence of policy gradient methods, temporal difference learning, and techniques from stochastic primal-dual optimization literature. Our algorithms feature two stages, requiring agents to agree on an etiquette before starting their interactions, which is feasible for instance in self-play. On the other hand, the agents only access to joint reward and joint next state and not to each other's actions or policies. Our sample complexities also match the best-known results for global convergence of policy gradient and two time-scale actor-critic algorithms in the single agent setting. We provide numerical verification of our method for a two-player bandit environment and a two player game, Alesia. We observe improved empirical performance as compared to the recently proposed optimistic gradient descent ascent variant for Markov games.
- One-sentence Summary: We improve the sample complexity of actor-critic algorithms for solving zero-sum Markov games.
- Supplementary Material: zip