Efficient Policy Space Response Oracles

Ming Zhou; Jingxiao Chen; Ying Wen; Weinan Zhang; Yaodong Yang; Yong Yu; Jun Wang

Efficient Policy Space Response Oracles

Ming Zhou, Jingxiao Chen, Ying Wen, Weinan Zhang, Yaodong Yang, Yong Yu, Jun Wang

22 Sept 2022 (modified: 12 Oct 2025)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: reinforcement learning, multi-agent reinforcement learning

Abstract: Policy Space Response Oracle methods (PSRO) provide a general solution to approximate Nash equilibrium in two-player zero-sum games but suffer from two drawbacks: (1) the \textit{computational inefficiency} due to consistent meta-game evaluation via simulations, and (2) the \textit{exploration inefficiency} due to learning best responses against fixed meta-strategies. In this work, we propose Efficient PSRO (EPSRO) that considerably improves the efficiency of the above two steps. Central to our development is the novel subroutine of \textit{no-regret optimization} on solving \textit{unrestricted-restricted (URR)} games. By modeling the EPSRO as URR game solving, one can compute the best responses and meta-strategies in a single forward pass without extra simulations. Theoretically, we prove that the proposed optimization procedures of EPSRO guarantee the monotonic improvement on the exploitability, which is absent in existing researches of PSRO. Furthermore, we prove that the no-regret optimization has a regret bound of $\mathcal{O}(\sqrt{T\log{[(k^2+k)/2]}})$, where $k$ the size of restricted policy set. The pipeline of EPSRO is highly parallelized, making policy-space exploration more affordable in practice and thus more behavioral diversity. Empirical evaluations on various games report that EPSRO achieves a 50x speedup in wall-time and 2.5x data efficiency while obtaining comparable exploitability against existing PSRO methods.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/efficient-policy-space-response-oracles/code)

5 Replies

Loading