
\subsection{Related Works}

Since our paper is of a theoretical nature, we limit ourselves to presenting prior work focused on theory.
%fits in this category.
%\pc{Our paper is related to the literature on reward-free, multi-agent, distributed, and deployment-efficient RL, as well as the literature on parallel and federated bandits.}

{\par \textbf{Multi-agent RL (MARL).}} %Our paper is an MARL problem. 
Although the applied MARL literature has been around for decades, theoretical works have been gaining more presence in recent years -- we refer the reader to the recent surveys~\citep{zhang2020model,HernandezLeal2019Survey,Yang2021Survey}. 
%\pc{[Mention surveys and how they focus on different things.]}
Importantly, we highlight that a large body of %most 
recent works have focused on the study of learning in
%however, we highlight a few points. Most MARL works have been either in \pc{the cooperative setting (where all agents either have identical rewards or must cooperatively maximize a joint expected reward)} or 
the two-player zero-sum Markov game case -- where one player tries to maximize the expected reward while the other tries to minimize it. One reason for its popularity is that it can be formulated as a minimax game and Nash equilibria are easily characterized~\citep{zhang2021multi}. Recent works have been done both in the tabular setting, e.g.,~\citep{kozuno2021learning,zhang2020model,Bai2020SelfPlay,Liu2021SharpSelfPlay,jin2022vlearning}, and the linear function approximation setting, e.g.,~\citep{chen2022almost,Cisneros-Velarde2022OnePolicyEnough,SQ-JY-ZW-ZY:22}. 
%The popularity of the linear approximation setting partly stems from the use of optimism in the face of uncertainty (as we will see below).%all cited works use it). 
In the case of general-sum Markov games, another large body of work has focused on providing guarantees for finding other solution concepts such as coarse correlated equilibria (CCE); e.g.,~\citep{Liu2021SharpSelfPlay,jin2022vlearning,Mao202REfficientRLGeneralSum}. Minimax sample optimality %in terms of sample efficiency 
has been shown -- under certain assumptions -- for finding CCE in general-sum games and Nash equilibria in zero-sum games for the tabular case~\citep{li2022minimaxoptimal}. 
%
%An increasingly popular framework for learning two player zero-sum games in both tabular and function approximation regimes are based on V-learning, e.g.~\citep{Bai2020SelfPlay,Liu2021SharpSelfPlay,jin2022vlearning}, because it leads to more efficient sample complexity: the action spaces are added instead of multiplied.  
%
In learning Nash equilibria, \cite{Liu2021SharpSelfPlay} proposed a Nash Q-learning algorithm for general-sum games in the tabular setting, with an underlying episodic MG -- no extra conditions on the Nash equilibria are required.
%and is based on the principle of optimism and pessi.
%, which implicitly assumes the existence of pure Nash equilibria in the general-sum game and is based on the principle of optimism. 
%
%
While writing our paper we found the recent preprint by~\cite{chengzhuo2022RLGSMG} who studied representation learning in general-sum games %(model-based and model-free), 
and whose proposed algorithms 
%are also able to 
output a policy after a number of episodes.
%in order to achieve, for example, an $\epsilon$-Nash equilibrium. 
They focus on the harder problem of learning the feature vector of the linear approximation, whereas we assume it is given -- we only focus on learning a good policy and not on learning a good representation. 
%low-rank MDPs. In contrast, our paper assumes the representation is given and our function approximation is a subclass of low-rank MDPs -- thus we only care about learning a good policy and not on learning the representation, which inherently would add more sample complexity. 
Thus our guarantees are not directly comparable. 
%
Finally, we remark that both~\cite{Liu2021SharpSelfPlay} and~\cite{chengzhuo2022RLGSMG} use the principle of optimism and pessimism, so they compute two Q-functions on their algorithms, while we compute just one optimistic Q-function. 
%
Two recent works~\citep{cui2023breaking,wang2023breaking} used function approximation and sought to avoid an exponential dependence on the size of the action spaces of the agents on the regret bounds when specialized to the tabular setting. While our results have such dependency when specialized to the tabular case, our setting is different than theirs. \cite{cui2023breaking} considered linear function approximation with each agent having its own feature vector encoding only its own action space, whereas we consider a feature vector that encodes the joint action space. Moreover, their work, unlike ours, restricted the underlying Markov game to be a potential Markov game when considering NE. The work by~\cite{wang2023breaking}, also considered independent feature vectors and was only concerned with CCE and correlated equilibria (CE) as solution concepts. %It is also based on V-learning, whereas our algorithm is based on Nash Q-learning.
%
%However, to the best of our knowledge, no algorithm under the linear function approximation scheme --- inspired by the treatment of large or continuous state spaces ---- has been proposed. Our work fills this gap.
%
%However, although the first formal asymptotic analysis for general-sum games of Nash Q-learning was done two decades ago, 
%we find it
%%it is
%surprising that
%it has not been analyzed using modern theoretical RL (or MARL) tools in the context of general-sum games (a Nash Q-learning type algorithm has been proposed for two-player zero-sum games by~\cite{Bai2020SelfPlay}).
%%algorithms targeting (broader subclasses of) general-sum Markov games have not receive an analysis using more modern theoretical RL tools.
%We believe this is a good opportunity for our paper's contribution
%since Nash Q-learning may expand the interest in studying algorithms that may work for other solution types than the ones studied so far in general sum Markov games. 
% 
%%\pc{"Besides the Nash Q-learning work we just explained in the Introduction, we present here more works on ..."}
%%{\par \textbf{Learning in Markov games}} 
%%
%%We must mention that~\cite{Bai2020SelfPlay} proposes and analyzes a Nash Q-learning algorithm with optimism for the two-player zero-sum game in the tabular case -- a more general setting with more players and using function approximation has not been treated to the best of our knowledge.
%
%\pc{[Just commented the paragraph about policy methods that are non-RL.]}
%In the surveys cited above, we can find that there is another increasing literature in learning policies for Markov games in the setting where the dynamics are known and there is perfect access to the environment (so not an RL setting) -- consequently, there is no learning from samples. Unsurprisingly, as in MARL, here most of the works have been focused on two-player zero-sum games; e.g.~\citep{zhao2021provably}, and also on a restrictive class of Markov games called \emph{potential games} (being a direct example of it, cooperative games where all agents have the same reward). All the previously cited works only focus on policy gradient methods and in the tabular setting.
%
%\pc{[Then talk about how most of works are on zero-sum and cooperative in the theoretical side]}


%
%related to %fully 
%cooperative %MARL~\citep{boutilier1996planning}. 
%, where all agents try to maximize their joint reward --- in our case, 
% all agents collect trajectories that will be used to maximize a centralized value function. 
%
%In the online cooperative setting, \citet{agarwal2021communication} proposed a communication efficient algorithm for %the online learning of 
%tabular MDPs, 
%~\citet{zhang2019distributed, lin2019communication, suttle2020multi} studied distributed variants of the actor-critic algorithm~\citep{konda1999actor}, but did not explicitly characterize the benefits that arise from parallel exploration. 
%



{\par \textbf{Linear function approximation in RL.}} The idea of using linear function approximation is ubiquitous in theoretical RL. The first works to combine it with %employ its use alongside the introduction of 
optimism for sample efficient learning were~\citep{CJ-ZY-ZW-MIJ:20, Yang2020RL} for online RL. 
%\pc{"The first owkr in combining LSVI with optimism for efficient learning was Jin, then, many works expanded such framework to include different situations within RL: two-player RL, efficient updating, parallel exploration, Stackelberg, etc.}
Since then, such setting has been adapted to different RL problems, such as representation learning (of the feature vector of the linear function approximation), e.g.~\citep{agarwal2020flambe}; parallel learning (multiple agents learning through independent MDPs but being able to communicate their experience), e.g.,~\citep{dubey2021provably}; 
deployment efficiency (RL algorithms when the number of times a policy can be updated is restricted), e.g.,~\citep{MG-RX-SSD-LFY:21}; 
reward-free RL (where exploration and exploitation are separated in different learning stages), %separates the collection of trajectories from the learning of an optimal policy
e.g.,~\citep{RW-SD-LY-RS:20,AW-YC-MS-SSD-KJ:22}. 
%; Stackelberg games, e.g.,~\citep{Zhong2023Stackelberg}.
%
Some works have combined two of the aforementioned problems %within the framework of 
using 
linear function approximation; e.g. in the context of reward-free RL, ~\citet{huang2021deployment} studied deployment efficiency, whereas~\citet{Cisneros-Velarde2022OnePolicyEnough} studied the effect of parallel exploration. 
%
%{{\par \textbf{Deployment Efficient RL}}} Drawing inspiration from bandit learning with low switching costs~\citep{auer2002finite, cesa2013online}, a recent line of work studies efficient algorithms for RL when the number of times the exploration policy changes is restricted.~\citet{bai2019provably} provided a low switching cost $Q$ learning algorithm for tabular MDPs and~\citet{MG-RX-SSD-LFY:21} derived a provably efficient low switching cost algorithm for linear MDPs.
%\citet{matsushima2020deployment} proposed a model-based approach for deployment efficient learning.
%\citet{}
%
%We remark, again, that linear function approximation has been used for two-player zero-sum Markov games in MARL.
%
These works follow a similar skeleton in their algorithms since all of them have in common the use of optimism and value iteration -- it is in this framework that we decided to propose an algorithm based on Nash Q-learning.
%
%
%
%
%
%
%{\par \textbf{Reward-free RL}} Reward-free RL studies the problem, first proposed in~\citet{jin2020reward}, in which agents try to explore the environment without knowing their individual reward functions. The setting focuses only on the exploration capabilities of different algorithms and serves as a level playing field when comparing different exploration strategies.~\citet{jin2020reward} studied reward-free exploration in tabular MDPs and~\citet{RW-SD-LY-RS:20} proposed a provably efficient reward-free exploration algorithm for linear MDPs.~\citet{SQ-JY-ZW-ZY:22} proposed a reward-free exploration algorithm when using a kernel function approximation that is efficient for both MDPs and two-player zero-sum MGs. More recently,~\citet{AW-YC-MS-SSD-KJ:22} provided a tighter reward-free exploration algorithm that matches the sample complexity of PAC RL for linear MDPs, while \citet{agarwal2020flambe} studied reward-free RL with unknown feature representation in low-rank MDPs.
%
%
%{\par \textbf{Concurrent RL}} 
%The works~\citep{YB-TX-NJ-YXW:19} and~\citep{ZZ-ZY-XJ:20} also analyze the setting where parallel agents can perform exploration with a single policy. Conditions for almost-linear speedup are provided, but their setting is different than ours: they only focus on the tabular case for online RL and do not study MGs. 
%%Lower bounds for parallel algorithms are not studied. 
%We remark, though, that they are more interested in the problem of decreasing %%lowering the switching cost, i.e., 
%the updating frequency of policies in RL.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End:\label{eq:function_class-MG}

The paper is organized as follows. In Section~\ref{sec:preliminaries}, we formally introduce the setting. In Section~\ref{sec:NashQ-analysis}, we introduce our Nash Q-learning algorithm and state our main result. In Section~\ref{sec:NashQproof}, we provide a sketch of the proof and some nuances of its formal analysis. Section~\ref{sec:conclusion} is the conclusion.