
\section{Introduction}
\label{sec:introduction}

Multi-agent reinforcement learning (MARL) has been successfully applied to a diversity of problems, such as solving the games of Go~\citep{alphaGo:16, alphaZero:17} and Starcraft~\citep{vinyals2019grandmaster}, coordination of unmanned aerial vehicles~\citep{pham2018cooperative}, autonomous driving~\citep{Dinneweth2022SurveyCar}, power systems~\citep{foruzan18Microgrid}, and management of water and energy resources~\citep{Ni2014WaterResource,Lingxiao2020Energy}. The theory and development of %On the theoretical side, 
multi-agent reinforcement learning algorithms is currently a prolific area, as attested by various recent surveys on the field, e.g.,~\citep{zhang2021multi,HernandezLeal2019Survey,Yang2021Survey}. In general, employing MARL to solve for a Nash equilibrium general-sum Markov game is computationally complex~\citep{Daskalakis2009Complexity}. This motivated theoretical works to look for other weaker solution concepts (e.g., coarse-correlated equilibria), or, if looking for a Nash equilibrium, either: (i) leave the general-sum domain and focus on zero-sum games or fully cooperative games, or (ii) specify extra conditions for the underlying general-sum Markov game (MG)~\citep{zhang2021multi}. The seminal work~\citep{Hu2003NashQ} introduced the \emph{Nash Q-learning} algorithm in the context of infinite-horizon discounted Markov games. The idea of Nash Q-learning %the algorithm 
is that, at every time step, each agent needs to find a Nash equilibrium which solves some static game whose utilities or rewards are defined 
%depending on 
by 
the (estimates of the) Q-functions of all the %rest of the 
agents -- this is also called a \emph{stage game}.
%
Thus, a motivation for using Nash Q-learning is its algorithmic simplicity: it solves a static game where Q-learning (for classic single-agent RL) would otherwise solve for an optimum.
%
In~\citep{Hu2003NashQ}, asymptotic learning guarantees are given when the chosen Nash equilibrium is consistent in all stage games and is either a \emph{global optimal} or a \emph{saddle} one. %Then, a strong sufficient condition for Nash Q-learning is to ensure that at \emph{every} time-step, we can always find a Nash equilibria of these two types. 
%\pc{Note the similarity with Q-learning in traditional single-agent RL, where at each time-step, the single agebt takes an action that maximizes its estimated Q-function.} 
Despite this strong sufficient condition,~\cite{Hu2003NashQ} presented numerical examples where Nash Q-learning solves games that do not satisfy such conditions. 
It is important to remark that there exist proven cases in which value-based methods -- encompassing Nash Q-learning -- cannot converge to a single Nash equilibrium of general-sum Markov games~\citep{Zinkevich2005CyclicEquilibria}. However, remarkably, Nash Q-learning stands as one of the few general-sum MARL algorithms and has elicited the development of algorithms specialized to other classes of Markov games or focused on other solution concepts. Further, it is still consistently cited in the applied literature~\citep{HernandezLeal2019Survey}.  

The first 
%original and -- to the best of our knowledge -- only 
formal proof for Nash Q-learning  by~\cite{Hu2003NashQ} only provided formal guarantees for asymptotic convergence in the tabular setting. However, recently, about two decades later, ~\cite{Liu2021SharpSelfPlay} proposed a type of Nash Q-learning algorithm and used a modern approach %analysis tools 
from the theoretical reinforcement learning (RL) literature to establish finite-sample guarantees and thus guarantee the sample efficiency of learning in the tabular setting. ~\cite{Liu2021SharpSelfPlay} used regret as a performance metric, and thus it was of interest that the average performance of policies gets closer to the performance of a Nash equilibrium instead of an actual convergence to a single equilibrium.

In the modern RL literature, it is known that tabular approaches are not ideal in environments where the state space is large or continuous. This has motivated the development of \emph{linear function approximation}, where, for example, the transition kernel and reward function of the underlying Markov decision process (MDP) are a linear function of a vector of features~\citep{CJ-ZY-ZW-MIJ:20, Yang2020RL}.

Taken together, these prior works motivate the central question of our paper:  

\emph{Can we obtain finite-sample guarantees and sample efficiency for Nash Q-learning in the linear function approximation regime?} %Finite-sample guarantees \emph{provide} finite-convergence guarantees in the sense that in the context of MARL (and RL), finite-sample guarantees implies exploration of the environment for a finite time.

We answer this question positively by proposing a Nash Q-learning algorithm -- called \emph{\algname} (\algbrev) -- and providing its finite sample guarantees under a regret performance metric. Interestingly, we find that the sample efficiency of our algorithm nearly matches the one reported in~\citep{CJ-ZY-ZW-MIJ:20} for (single-agent) RL in the same approximation regime.

%
%Answering this question is important given the relevance of Nash Q-learning in the literature of MARL and being one of the few algorithms that attempts to work under general-sum Markov games for finding a relatively strong solution concept -- a pure Nash equilibrium. 

In general, our central question is also motivated from the fact that an increasing number of works providing sample efficient guarantees for (single-agent) RL problems has appeared in recent years.
%Due to the interest in dealing with large state spaces (possibly continuous), a great amount of work has moved from the tabular setting to the linear function approximation setting. Starting from 
The works~\citep{CJ-ZY-ZW-MIJ:20, Yang2020RL} started providing such guarantees in the linear function approximation domain using the principle of \emph{optimism} under uncertainty for \emph{online} RL -- later, other works have applied it to \emph{reward-free} RL (e.g.~\citep{RW-SD-LY-RS:20}) and have even applied a counterpart principle, called \emph{pessimism}, to \emph{offline} RL~\citep{Jin2021Pevi}. %A version of 
%Optimism and pessimism has been used together in MARL problems related to two-player zero sum games~\citep{SQ-JY-ZW-ZY:22}. 
Optimism consists of adding a bonus so that the estimated optimistic Q-function rewards more those state-action pairs that have been less explored. %, thus motivating their further exploration. 
Pessimism basically does the opposite by subtracting a bonus value. However, when it comes to (online) MARL, to the best of our knowledge, the simultaneous application of optimism and pessimism to achieve sample efficiency for learning Nash equilibria has mainly been limited to two-player zero-sum games in the linear function approximation case~\citep{SQ-JY-ZW-ZY:22}, and to general-sum games in the tabular case~\citep{Liu2021SharpSelfPlay}. In this work, we show that the principle of optimism can easily be applied to Nash Q-learning in general-sum games.
%and hopefully motivate the analysis of more complex algorithms -- which perhaps look for weaker solutions --- in the domain of MARL. 
%
%We remark that optimism has also been applied to \emph{reward-free} RL (e.g.~\citep{RW-SD-LY-RS:20}),{%~\citep{jin2020reward}),
%besides the aforementioned classical \emph{online} RL setting; and a counterpart principle, called \emph{pessimism}, has been applied in \emph{offline} RL~\citep{Jin2021Pevi}.       
%
%For example, \cite{nair2015Massively} synchronized the value function estimates stored on each agent and used the same value function to guide subsequent exploration. As a result of having multiple agents collecting data simultaneously and exploring the environment, parallel methods are faster than their single agent or sequential counterparts. They are able to learn near-optimal policies in a relatively short amount of time --- only $K$ rounds are needed to collect and use a total of $KP$ trajectories.  
%
%\cite{dimakopoulou2018coordinated} proposed multiple sampling-based algorithms that provide agents with a diverse set of exploration policies. \cite{mahajan2019maven} proposed the use of mutual information to ensure that agents explore with a diverse set of policies. At a high level, these alternatives ensure that the agents' exploration policies are sufficiently different from one another, and argue that the diversity speeds up the learning performance.
%
%Taken together, existing theory and practice motivate a central question: \emph{is diversity of exploration (by different policies) always required for efficient parallel exploration in RL?} By \emph{efficient exploration} we refer to an exploration that results in a speedup of the learning performance. 

%
%We consider our central question in the MARL setting with an underlying \emph{episodic} (or finite-horizon) Markov game, thus complementing the existing asymptotic analysis in \emph{discounted} (or infinite-horizon) Markov games in~\citep{Hu2003NashQ}. 
%
%
%We believe our finite-sample results for Nash Q-learning could lead to advances in theoretical MARL and lead to the application of the techniques being used --- now almost standard in theoretical RL -- to other problems in MARL that could go beyond just zero-sum or fully cooperative games and/or which can find weaker solutions for the game.
%
%\pc{"Our results suggest..."}
%//

\paragraph{Contributions} We summarize our contributions.
\begin{itemize}
    \item We provide the first sample efficient guarantees for a Nash Q-learning algorithm in the linear function approximation regime for general-sum games -- obtaining a regret bound $\tilde{\cO}(\sqrt{Kd^3H^5})$, with $K$ being the number of episodes, $H$ the episode length, and $d$ the dimension of the feature vector of the linear function approximation.
    %
    %\item To prove our regret, we propose the \algname~(\algbrev) algorithm, making use of the principle of optimism.
    %%, in the episodic Markov game setting, and perform its analysis based on a regret metric. 
    %%We emphasize that the original Nash Q-learning was proposed in the context of tabular MARL and discounted Markov games. 
    %Interestingly, our sufficient conditions for sample efficiency matches the conditions by~\cite{Hu2003NashQ}, even though their analysis is of a different nature -- more on  Section~\ref{sec:NashQproof}. 
    %
    \item To prove our guarantees, we propose the \algname~(\algbrev) algorithm.
    %, making use of the principle of optimism.
    %, in the episodic Markov game setting, and perform its analysis based on a regret metric. 
    %We emphasize that 
    The original Nash Q-learning proposed by~\cite{Hu2003NashQ} was in the context of tabular and discounted MGs, and considered convergence to a Nash equilibrium as a performance metric. In contrast, we consider episodic MGs with regret performance, and do no need the existence of special Nash equilibria on the stage games as in~\cite{Hu2003NashQ}.
    %
    \item When directly transforming it to the tabular case, our performance bound has a polynomial gap on all factors except for the number of episodes $K$ compared to the best-known result by~\cite{Liu2021SharpSelfPlay}. 
    %
    \item 
    %Similar to how Nash Q-learning resembles standard Q-learning, and its analysis are somehow related (by the use of contraction theory~\citep{Hu2003NashQ} and each agent taking a decision depending on the value of the estimated Q-function); our algorithm and its analysis is related to the one proposed by~\citep{CJ-ZY-ZW-MIJ:20}.
    In the single agent case, our \algbrev ~algorithm collapses to the model-free RL algorithm proposed by~\cite{CJ-ZY-ZW-MIJ:20} (instead of taking a (mixed) Nash equilibrium at each stage game, the agent takes the optimal greedy action). Remarkably, we show that our algorithm's sample efficiency differs only by a factor of $H$ -- the length of the episode -- compared to the single agent one. To the best of our knowledge, this is the first time a general-sum MARL algorithm nearly matches the sample efficiency of an RL algorithm. 
    %
    %\item Using regret as a performance metric, we prove it is bounded by \pc{$O(XXX)$}, thus obtaining the same efficiency as reported for the (single-agent) online RL case in~\citep{CJ-ZY-ZW-MIJ:20}.  
    %This is, to the best of our knowledge, the first time a general-sum MARL algorithm has been proven to achieve same sample efficiency as an RL algorithm. We provide 
    %%\pc{we make use of LSVI plus optimism. Even though it has always been adapted to different settings in RL (single agent), (?RL?.}
    %    
\end{itemize}

