\section{Related Work}\label{sec:related}
\subsection{Bayesian optimization approach}
The most closely related line of research focuses on addressing game-theoretic models that are computationally expensive to evaluate using Bayesian Optimization (BO) techniques. \cite{al2018approximating} proposed a method to find equilibria for such games in a sequential decision-making framework using BO. Specifically, they introduced the \textit{game-theoretical regret} of a strategy profile $\x$ as the most utility any agent $i$ can gain by deviating from $x_i$ to any strategy in $\xxx_i$. The authors employ BO to minimize an approximation of the game-theoretic regret and approximate the pure strategy NE. The performance in terms of game-theoretical regret of the proposed method is validated on a collection of synthetic games by comparison with some recent algorithms.

\cite{picheny2019bayesian} also studied the same problem of solving games with the GP-based approach. The main difference between this paper and \cite{al2018approximating} is the acquisition function used by BO. Instead of minimizing the game-theoretical regret like \cite{al2018approximating}, \cite{picheny2019bayesian} proposed two acquisition functions. Specifically, one acquisition function is to maximize the probability of achieving the equilibrium, while the other one is to reduce as quickly as possible an uncertainty measure related to the equilibrium.

\reviseFx{\cite{marchesi2020learning} proposed a multi-arm bandit algorithm on top of the Gaussian processes and offers theoretical justification. Our work differentiates from two perspectives. First, \cite{marchesi2020learning} focused on two-player zero-sum games, while our work allows multi-player normal-form games. Second, the regret analysis in \cite{marchesi2020learning} relied on a suboptimal gap in the denominators of the regret bound. As discussed by \cite{lattimore2020bandit}, the major problem with this dependency is that this gap, in practice, could be arbitrarily small and downgrade the practicality of the resulting regret analysis. At the same time, our theoretical results of the regret bound rely on the maximum mutual information of GP instead and are gap-independent.}


Recently, \cite{aprem2021bayesian} studied a specific form of games, termed potential games \citep{monderer1996potential}. 
Specifically, they utilized the structure of potential games and proposed to use a Gaussian process model for the potential function directly instead of modeling the utility functions like \cite{picheny2019bayesian}.

Compared to the previous work, the key contribution of our work is that we have a novel GP objective for NE learning. Furthermore, we present a no-regret learning algorithm that guarantees convergence to NE, addressing a gap in the existing literature, which lacked theoretical convergence analysis for similar approaches.

\subsection{Other online learning algorithms} 
Learning Nash Equilibria has been widely studied in the literature. Regret minimization serves as a closely related category of learning rules. In essence, an agent incurs ex-post regret if, during certain periods, they could have achieved a higher average payoff by choosing a different strategy. Several straightforward learning procedures exist that aim to minimize ex-post regret \citep{foster1999regret, hart2000simple, hart2001general,NEURIPS2019_68521755}. However, it is important to note that relying on ex-post regret minimization rules does not guarantee behaviors consistently converging to the Nash equilibrium. What the evidence supports is that these rules cause the empirical frequency distribution of play to converge to the set of correlated equilibria, which, while including Nash equilibria, is frequently much larger and not necessarily more desirable in terms of strategic outcomes.

Another relevant learning rule is regret testing \citep{foster2006regret}. Here, an agent compares their average per-period payoff over an extended sequence of plays with the average obtained through occasional experiments with alternative strategies. \cite{foster2006regret} demonstrated that, for all finite two-person games, this rule approximates Nash equilibrium behavior most of the time. Moreover, \cite{germano2014global} later established that a modification of this procedure comes close to Nash equilibrium behavior in any finite $n$-person game with generic payoffs.


Another, less closely related, set of learning rules is those based on interactive learning by trials \citep{karandikar1998evolving,young2009learning,marden2009payoff}. In this context, an agent learns through trial and error by occasionally experimenting with new strategies, and discarding choices that fail to yield higher payoffs. They demonstrate the ability to approach pure Nash equilibrium and play a high proportion of the learning period, but typically they do not converge.

\reviseFx{Recently, \cite{gemp2023approximating} proposed a novel loss function for Nash equilibrium learning in general games that is amenable to Monte Carlo estimation and allows applying SGD for efficient optimization. Though tackling a similar problem from different perspectives, the combination of a gradient-based optimizer with a Monte-Carlo estimator and a GP-based bandit algorithm has drawn interest in BO literature \citep{balandat2020botorch} and indicates an interesting future direction.}

Similar to the work by \cite{aprem2021bayesian}, 
\cite{chapman2013convergent} also studied convergence to Nash equilibria in potential games with rewards that are initially unknown. Different from the Bayesian optimization approach, they proposed a multi-agent version of Q-learning to estimate the reward functions using novel forms of the $\epsilon$–greedy learning policy. \cite{jordan1991bayesian} studied Bayesian learning of equilibrium, assuming each agent knows their utility information but not others. This work is also related to learning other equilibrium concepts in game theory and Bayesian optimization with multiple structured utility functions, we refer to Appendix \ref{sec:additional_related} for more detailed discussions and comparisons.

