Abstract: Policy space response oracles (PSRO) is a promising tool to find an approximate Nash equilibrium (NE) in a two-player zero-sum game. It solves the equilibrium by iteratively expanding a small-scale meta-game formed by a restricted strategy population consisting of historical approximate best responses of the meta-games. However, since these best responses have a strong correlation with each other, existing PSRO and its variants often have the slow diversity growth of the strategy population, and thus suffer from poor exploration efficiency and slow convergence rate. To address this problem, this article proposes Purified PSRO, which deliberately maintains a pure strategy population formed by pure strategy bases of approximate best responses. A novel module namely non-best response suppression (NBRS) is introduced to calculate a pure strategy base with better orthogonality to expand the strategy population at each epoch. In this way, Purified PSRO can quickly increase the diversity of the strategy population, thus greatly enhance the efficiency of exploration. Theoretically, we prove the convergence of Purified PSRO. Moreover, we introduce an early stop module to reduce computation cost, and give the upper bound of the exploitability when the algorithm stops early. Extensive experiments on random games of skill (RGoS) and real-world meta-games show that Purified PSRO can consistently outperform existing SOTA methods, sometimes with a large margin.
External IDs:dblp:journals/tnn/ShaoZHLW25
Loading