Learning to Safely Exploit a Non-Stationary OpponentDownload PDF

21 May 2021 (modified: 05 May 2023)NeurIPS 2021 SubmittedReaders: Everyone
Keywords: multi-agent learning, reinforcement learning, opponent modeling
Abstract: In dynamic multi-player games, an effective way to exploit an opponent's weaknesses is to build a perfectly accurate opponent model. This renders the learning problem a single-agent optimization which can be solved by typical reinforcement learning. However, naive behavior cloning may not suffice to train an exploiting policy because opponents' behaviors are often non-stationary due to their adaptations in response to other agents' strategies. On the other hand, overfitting to an opponent (i.e., exploiting only one specific type of opponent) makes the learning player easily exploitable by others. To address the above problems, we propose a method named Exploit Policy-Space Opponent Model (EPSOM). In EPSOM, we model an opponent's non-stationarity by a series of transitions among different policies, and formulate such a transition process through non-parametric Bayesian methods. To account for the trade-off between exploitation and exploitability, we train a player to learn a robust best response against the opponent's predicted strategy by solving a modified meta-game in the policy space. In this work, we consider a two-player zero-sum game setting and evaluate EPSOM on Kuhn poker; results suggest that our method is capable of exploiting its adaptive opponent, whilst maintaining low exploitability (i.e., achieving safe opponent exploitation). Furthermore, we show that our EPSOM agent has strong performance against unknown non-stationary opponents without further training.
Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.
Supplementary Material: pdf
21 Replies

Loading