A reinforcement learning framework based on regret minimization for approximating best response in fictitious self-play

Yanran Xu; Kangxin He; Shu Hu; Hui Li

A reinforcement learning framework based on regret minimization for approximating best response in fictitious self-play

Yanran Xu, Kangxin He, Shu Hu, Hui Li

Published: 01 Jan 2022, Last Modified: 13 Nov 2024HPCC/DSS/SmartCity/DependSys 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Finding Nash equilibrium in imperfect information game is of significance because many real-world games can be described as games with partially observed players. Neu-ral Fictitious Self-Play (NFSP) combines Fictitious Self-Play (FSP), a popular game-theoretic model, with machine learning to approximate Nash equilibrium without domain knowledge. In two-player zero-sum game, when both partially informed agents update their strategies simultaneously, for each agent, the environment is unstable and partially observable. For one thing, the instability of environment brings error in approximate best response generated by reinforcement learning algorithm. For another, it is necessary to explore more about the opponent's strategy. To solve the problems mentioned above, we propose a new reinforcement learning framework named Expected Average Regret Minimization (EARM), which can approximate better best response by estimating expected average regret and different regret minimizer. In this framework, we propose a new regret minimizer called Temperature Regret Matching (TRM) to control the exploration degree flexibly. Then we propose a novel algorithm EARM-FSP by combining EARM with FSP, which solves the slow convergence problem in NFSP. EARM-FSP converges faster and reaches a better local optima in imperfect information games on OpenSpiel, e.g., Kuhn Poker, Leduc Poker and Liar's Dice.

Loading