
\section{Nash Q-learning and its Analysis}
\label{sec:NashQ-analysis}

We propose a simple Nash Q-learning algorithm based on linear function approximation and optimism named 
%, with the name of 
%based on least-squares value iteration as introduced by the LSVI-UCB algorithm in~\citep{CJ-ZY-ZW-MIJ:20}. We name our algorithm 
\emph{\algname} or \algbrev %~(pronounced \emph{n-q-o-vi}, with ``vi" as in the word ``visa"), 
%\emph{Parallel Optimistic Least-Squares Value Iteration} (POLSVI), 
~as described in Algorithm~\ref{alg:main_LIN_UCB_LSVI}.
%
%\pc{"They solve a static game which we call a \emph{stage game}"}

We provide an outline of the \algbrev ~algorithm. At each iteration $k\in[K]$ and step $h\in[H]$, 
%
%
the information of the explored state-action trajectories described by 
%the actions of 
the agents in the game at the same step but up to the previous episode is collected in a covariance matrix $\Lambda^k_h$ (line 6 of Algorithm~\ref{alg:main_LIN_UCB_LSVI}). Then, all the agents participate in a static game described by some prior optimistic estimates of Q-functions 
and a Nash equilibrium is computed 
%and a global optimal or saddle Nash equilibrium is computed (see Definition~\ref{def:global-saddle}) -
-- this static game is called a \emph{stage game} because it is solved in every episode and time-step (and depends on the current state of the Markov game). Then each agent, using its computed Nash policy from the stage game, computes a new \emph{optimistic} estimate of the Q-function (line 10), using the optimism bonus $\beta(\phi(\cdot,\cdot)^\top(\Lambda^k_h)^{-1}\phi(\cdot,\cdot))^{1/2}$. Then, all the agents jointly explore the environment (lines 14-16) by taking actions coming from their respective policies computed from stage games. The resulting state-action trajectory across the episodes will then be collected %by the central server at the next episode 
and the whole process repeats. %We have the following result for the \algbrev ~algorithm.

\begin{remark}[Computational aspects] %Algorithm~\ref{alg:main_LIN_UCB_LSVI} requires the computation of a Nash equilibrium (NE) for the static game in line 14.
%
Though a (mixed) Nash equilibrium (NE) is always guaranteed to exist for the static game defined by the optimistic Q-value function in lines 14 and 16 of Algorithm~\ref{alg:main_LIN_UCB_LSVI}, solving for an (exact) NE is in general computationally intractable~\citep{Chen2009SettComplNE,Daskalakis2009Complexity}.
%since finding an exact NE is PPAD-complete. Additionally, we will cite the following two relevant works:
\end{remark}

%{\par \textbf{About our assumptions on information access in \algbrev.}} 
\begin{remark}[About %our assumptions on 
information access in \algbrev]
In this paper, we are primarily concerned with analyzing \algbrev ~as a solver for the policies of the underlying Markov game,  e.g, as done in the recent work~\citep{Liu2021SharpSelfPlay}. One could think of relaxing some implementation details such as the information each agent has access to across episodes, but this is beyond the scope of the paper. For example, at each step $h\in[H]$ and iteration $k\in[K]$, 
one could make the optimistic Q-functions of every agent $i\in[n]$, $Q_h^{i,k}$, be private information to the rest of the agents. Then,
%we assume full access to the optimistic Q-functions of every agent ($\{Q_h^{i,k}\}_{i\in[n]}$ in lines 7 and 14). If we were interested in making the these Q-functions private information for each agent, we
%one 
%could let 
each agent would try to estimate the Q-functions of the rest of the agents based on the observation of %historical 
past rewards -- an idea already outlined in~\citep{Hu2003NashQ}. 
\end{remark}
 

\begin{algorithm}[t!]%[tb]
  \caption{\algname~ (\algbrev)}
  \label{alg:main_LIN_UCB_LSVI}
\begin{algorithmic}[1]
    \STATE {\bfseries Input:} $K$, $\beta$, $\lambda$
    \FOR{episode $k\in[K]$}
        % \STATE Receive initial state $s_0$
        \STATE $x_1^{k}\gets s_0$ 
        %\STATE \# DONE BY CENTRAL SERVER:
         \STATE $Q_{H+1}^{i,k}(\cdot,\cdot)\gets 0$, $i\in[n]$
         %\STATE Share $Q^{i,k}_{h+1}(\cdot,\cdot)$
        \FOR{$h=H,\dots,1$}
            \STATE $\Lambda_{h}^k\gets\lambda I_d + \sum^{k-1}_{\tau=1}\phi(x_h^{\tau},a_h^{\tau})\phi(x_h^{\tau},a_h^{\tau})^\top$
            %\STATE $\nu^*\gets$ an \emph{optimal} or \emph{saddle} Nash Equilibrium for the $n$-player game $(Q^{1,k}_{h+1}(x^\tau_{h+1},\cdot),\dots,Q^{n,k}_{h+1}(x^\tau_{h+1},\cdot))$
            %\STATE $a^*\gets$ a (pure) Nash Equilibrium for the $n$-player game $(Q^{1,k}_{h+1}(x^\tau_{h+1},\cdot),\dots,Q^{n,k}_{h+1}(x^\tau_{h+1},\cdot))$
            \STATE $\pi^*\gets$ a  Nash Equilibrium for the $n$-player game $(Q^{1,k}_{h+1}(x^k_{h+1},\cdot),\dots,Q^{n,k}_{h+1}(x^k_{h+1},\cdot))$
            \FOR{$i\in[n]$}
            %\STATE $w_{h}^{i,k}\gets(\Lambda_{h}^{k})^{-1} \sum^P_{p=1}\sum^{k-1}_{\tau=1}\phi(x_h^{\tau},a_h^{\tau})$ $[r_h^i(x_h^\tau,a_h^\tau)+\E_{a\sim\nu^*}[Q^{i,k}_{h+1}(x_{h+1}^{\tau},a)]]$
            %\STATE $w_{h}^{i,k}\gets(\Lambda_{h}^{k})^{-1} \sum^P_{p=1}\sum^{k-1}_{\tau=1}\phi(x_h^{\tau},a_h^{\tau})$ $[r_h^i(x_h^\tau,a_h^\tau)+Q^{i,k}_{h+1}(x_{h+1}^{\tau},a^*)]$
            \STATE $w_{h}^{i,k}\gets(\Lambda_{h}^{k})^{-1} \sum^{k-1}_{\tau=1}\phi(x_h^{\tau},a_h^{\tau})$ $[r_h^i(x_h^\tau,a_h^\tau)+\E_{a\sim\pi^*}[Q^{i,k}_{h+1}(x_{h+1}^{\tau},a)]$
            \STATE $Q^{i,k}_{h}(\cdot,\cdot)\gets \min\{(w_{h}^{i,k})^\top\phi(\cdot,\cdot)+\beta(\phi(\cdot,\cdot)^\top(\Lambda_{h}^{k})^{-1}\phi(\cdot,\cdot))^{1/2},H\}$
            \ENDFOR 
        \ENDFOR
        %\STATE \# DONE BY EACH AGENT $p\in[P]$ IN PARALLEL:
        %\FOR{$p\in[P]$} %Not sure if put the p as in outer loop
        \FOR{$h\in[H]$}
            %\STATE $a^k_h\gets$ an \emph{optimal} or \emph{saddle} \pc{(pure)} Nash Equilibrium for the $n$-player game $(Q^{1,k}_{h}(x^k_{h},\cdot),\dots,Q^{n,k}_{h}(x^k_{h},\cdot))$
            %\STATE $a^k_h\in$  \pc{(pure)} Nash Equilibrium for the $n$-player game $(Q^{1,k}_{h}(x^k_{h},\cdot),\dots,Q^{n,k}_{h}(x^k_{h},\cdot))$
            \STATE $\pi^k_h(x^k_h)\gets$ a Nash Equilibrium for the $n$-player game $(Q^{1,k}_{h}(x^k_{h},\cdot),\dots,Q^{n,k}_{h}(x^k_{h},\cdot))$
            %%\STATE \pc{Take actions $a_{i,h}^{k} \sim \nu^*_i$, $i\in[n]$}
            %%for $p\in[P]$ 
            %%\# GREEDY POLICY
            %%\STATE $\nu^*\gets$ an \emph{optimal} or \emph{saddle} Nash Equilibrium for the $n$-player game $(Q^{1,k}_{h}(x^k_{h},\cdot),\dots,Q^{n,k}_{h}(x^k_{h},\cdot))$
            %%\STATE \pc{Take actions $a_{i,h}^{k} \sim \nu^*_i$, $i\in[n]$}
            %%for $p\in[P]$ 
            %%\# GREEDY POLICY
            \STATE Take $a_h^k \sim \pi^k_h(x^k_h)$
            \STATE Observe $x_{h+1}^{k}$ 
        \ENDFOR
        %\STATE \pc{\# So far, the doubling rounds are not treated, but I have commented the part where we could include a subroutine if we would like to treat them!}
        %
        %\STATE \#DONE BY CENTRAL SERVER
        %\STATE $\bar{\Lambda}_{h}^{k+1}\gets
        %\Lambda_{h}^{k} + 
        %\sum^P_{p=1}\phi(x_h^{k,p},a_h^{k,p})\phi(x_h^{k,p},a_h^{k,p})^\top$ 
        % \IF{$\bar{\Lambda}^{k+1}_h\preceq \bar{K}\Lambda^{k}_h$}
        %    \STATE \# KHERE IS NO DOUBLING ROUND
        %    \STATE Save $(a^{k,p}_h,x^{k,p}_h)$ for every $(p,h)\in[P]\times[H]$.
        %    \ELSE 
        %    \STATE \# PERFORM DOUBLING ROUNDS SUBROUTINE 
        %\ENDIF
    \ENDFOR
%
%  \REPEAT
%  \STATE \# TRAIN 
%  \UNTIL{done}
\end{algorithmic}
\end{algorithm}

We now present the paper's main result.

\begin{theorem}[Performance of the \algbrev ~algorithm]
\label{thm:main-nashQ}
%Assume that either all stage games (line 14) have a global optimal equilibrium or that all stage games have a saddle Nash equilibrium. Then, 
There exists an absolute constant $c_\beta>0$ such that, for any fixed $\delta\in(0,1)$, if we set $\lambda=1$ and $\beta=c_\beta dH\sqrt{\iota}$, with $\iota:=\log(dKH(n+2)/\delta)$, then, with probability at least $1-\delta$, 
\begin{equation}
\label{eqn:regret-res2}
%\begin{aligned}
\textnormal{Regret}(K)
\leq 
\cO\bigg(\sqrt{K}\sqrt{d^3H^5\iota^2}\bigg).
%+ \underbrace{\cO\bigg(\sqrt{ d^4H^4\iota}P\log\left(1+\frac{KP}{d}\right)\bigg)}_{\text{Overhead term}}.
%\end{aligned}
\end{equation}
\end{theorem}
%

{\par \textbf{Sample efficiency.}} Our regret bound is sublinear in the number of episodes $K$ and -- ignoring logarithmic terms -- polynomial on the parameters $d$ and $H$, %the number of features and episode length, 
i.e., there is learning with sample efficiency. Our finite-sample guarantee states that $K=\tilde{\cO}\left(\frac{d^3H^5}{\epsilon^2}\right)$ episodes are needed in order to achieve an average regret less or equal than $\epsilon$, i.e., for the policies across the episodes to perform on average as an $\epsilon$-Nash equilibrium. 

%{\par \textbf{About our obtained bound.}} 
{\par \textbf{About the number of agents.}}
%Our bound in~\eqref{eqn:regret-res2} provides us with a finite-sample bound: it specifies the number of episodes we need to achieve 
%According to our regret measure, %%Assume we desire $\textnormal{Regret}(K)\leq\epsilon$, then 
%we observe that: the larger the number of agents $n$, the larger the number of episodes or iterations $K$ we need to produce the same learning error. 
Our bound has a logarithmic dependence on the number of agents $n$. However, we remark that the feature dimension $d$ of the linear MG \emph{might} hide dependencies on $n$ depending on how the feature vector $\phi$ is constructed (more on this below, when discussing the tabular case).
%A reason why there is not a worse dependence on $n$ is from the fact that we assume in the linear MG model that the feature dimension $d$ is independent of $n$.  any effect resulting from more agents playing the game: no matter the size $n$, as long as the  from  resulting from the nu 
%
%According to our regret measure, %%Assume we desire $\textnormal{Regret}(K)\leq\epsilon$, then 
%we observe that: the larger the number of agents $n$, the larger the number of episodes or iterations $K$ we need to produce the same learning error. 
In any case, the larger the number of agents, the more samples are needed to achieve the same average regret performance.
%
Intuitively, this makes sense, since increasing the number of agents increases the number of possible decision makers and thus the complexity of the state-action space to be sampled. This is in stark contrast with other works in the single-agent RL case where multiple agents can be deployed to explore the \emph{same} state-action space of the MDP, in which case their performance measure improves with the number of agents~\citep{Cisneros-Velarde2022OnePolicyEnough}.


{\par \textbf{Comparison with (single-agent) RL.}} For the classic single-agent RL case ($n=1$), \cite{CJ-ZY-ZW-MIJ:20}  obtained, with the regret metric with respect to the optimal policy of the underlying MDP, the bound $\tilde{\cO}(\sqrt{K}\sqrt{d^3 H^4})$. Thus, our result is essentially larger %(up to logarithmic terms) 
by a factor $H$ -- thus nearly-matching the sample efficiency. Having to learn a Nash equilibrium of an MG thus requires more samples than what would be necessary for an MDP. It is important to highlight that though the single-agent case requires taking an action that maximizes the optimistic Q-function (see~\citep[LSVI-UCB 
Algorithm]{CJ-ZY-ZW-MIJ:20}), \algbrev ~requires solving for Nash equilibrium and thus is computationally more complex. 
%; (ii) in the single-agent case there is no extra assumptions such as in \algbrev ~for the existence of specific NE -- conditions present on the asymptotic analysis by~\citep{Hu2003NashQ}. 
%
%We remark that the right-hand side of our bound is the same as obtained by~\citep{CJ-ZY-ZW-MIJ:20} for the classic single-agent RL with $n=1$ -- and thus, for a different regret metric. Though this is a remarkable result, it is important to highlight: (i) though the single-agent case requires taking an action that maximizes the optimistic Q-function (see~\citep[LSVI-UCB 
%Algorithm]{CJ-ZY-ZW-MIJ:20}), \algbrev ~requires solving for specific Nash equilibrium and thus, while sharing the same sample-efficiency, it is computationally more complex; (ii) in the single-agent case there is no extra assumptions such as in \algbrev ~for the existence of specific NE -- conditions present on the asymptotic analysis by~\citep{Hu2003NashQ}. 
%% The regret metric that we would employ for the sequential counterpart becomes $\textnormal{Regret}(K) = \sum_{k=1}^{K} V_1^{\pi^*}(x^k_1) - V_1^{\pi^{k}}(x^k_1)$, where $K$ is the number of episodes and $\pi^k$ is the (greedy) policy taken by the single agent at episode $k\in[K]$. \citep{CJ-ZY-ZW-MIJ:20} proved that with probability $1-\delta$: $\textnormal{Regret}(K)\leq O(\sqrt{K}\sqrt{d^3H^4\iota^2})$, where $\iota=\log(2dKH/\delta)$ (under $\lambda=1$ and $\beta=c_\beta d H\sqrt{\iota}$, $c_\beta$ being some absolute constant).  Therefore, the base term in our learning regret~\eqref{eqn:regret-res2} indicates an almost linear speedup, because of the factor $\sqrt{KP}$ compared to the factor $\sqrt{K}$ in the performance of the sequential algorithm. In other words, in terms of the base term, there is a complexity \emph{equivalence} between performing the sequential algorithm for $KP$ episodes and performing the parallelized version with $P$ agents for $K$ episodes. 

{\par \textbf{Comparison with~\citep{Hu2003NashQ}.}} The original Nash Q-learning proposed by~\cite{Hu2003NashQ} has as its performance metric the convergence to a Nash equilibrium of the underlying discounted MG. In order to ensure such convergence, they assumed the existence of either global optimal or saddle Nash equilibria uniformly on every stage game -- see Definition~\ref{def:global-saddle}. In contrast, since we use regret in the context of episodic MGs, we are interested in the average performance of the computed policies across iterations, with the expectation that it will approximate a Nash equilibrium performance. Therefore, we are not strictly interested in convergence to a \emph{single} Nash equilibrium. For this reason, our proof makes no use of the assumptions across stage games by~\cite{Hu2003NashQ}. Their work and ours, though being model-free, use completely different proof techniques. 
%
%\cite{Hu2003NashQ} based their proof on the value-iteration idea that the estimated Q-functions originate from the successive application of an operator defined by the selection of Nash equilibria in the stage games. 
%In order to prove asymptotic convergence, 
%\cite{Hu2003NashQ} finds that two sufficient conditions to ensure contraction of the operator is the selection or global optimal or saddle Nash equilibria. %%~\citep[Lemma~16]{Hu2003NashQ}. 
%In contrast, in our proof, linear function approximation requires an analysis based on covering numbers whose upper bound make use of such equilibria.

{\par \textbf{Comparison with~\citep{Liu2021SharpSelfPlay}.}} The first Nash Q-learning algorithm in~\citep{Hu2003NashQ} was designed and analyzed for tabular RL. Motivated by concerns of large or continuous state spaces, we decided to opt for the function approximation regime. As it is known in the literature, a direct translation of the \algbrev ~algorithm to the tabular case can be done by letting the feature vector $\phi$ capture $d=|\cS||\A|=|\cS|\prod^n_{i=1}|\A_i|$, which would give our regret bound a complexity of $\tilde{\cO}(\sqrt{H^5|\S|^3|(\prod^n_{i=1}|\A_i|)^3K})$. In the tabular case,~\cite{Liu2021SharpSelfPlay} proposed the \emph{Multi-Nash-VI} algorithm which obtains $\tilde{\cO}(\sqrt{H^4|\S|^2|(\prod^n_{i=1}|\A_i|)K})$ -- tighter in horizon $H$ and both sizes of the state and action spaces of the agents. Interestingly, in the tabular case, both \algbrev ~and Multi-Nash-VI %-- though being Nash Q-learning type of algorithms -- 
are of different nature, %different in their computations 
since the former is model-free and the latter model-based. Interestingly as well, Multi-Nash-VI requires the computation of two Q-functions based on the constructed model -- one using optimism and another using pessimism --, whereas \algbrev ~requires only the computation of an optimistic Q-function.
%, but requires additional assumptions on stage games due to its original proof with function approximation (see Section~\ref{sec:NashQ-analysis}). 
As generally expected in general-sum MGs,  
%In this case, we observe we have the 
both suffer from the
\emph{curse of multi-agents} in the tabular case
since the sample bounds have exponential dependence on the number of agents (through the product of the cardinality of the agents' action spaces)~\citep{song2022whenlearningGenSum}.
%
%\pc{Much recent} effort in MARL has been devoted to avoid the \emph{curse of multi-agents}, so that the sample bounds do not exponentially depend on the number of agents -- particularly through the product of the cardinality of their action spaces -- in both zero-sum two player Markov games and general-sum ones (finding CCE)~\citep{Bai2020SelfPlay,Liu2021SharpSelfPlay,jin2022vlearning,Mao202REfficientRLGeneralSum}. 
%
%: according to Theorem~\ref{thm:main-nashQ}, in general, solving the Markov game becomes exponentially more difficult with the number of agents. This problem is well known in the tabular case and is subject of current research and the reason why recent work has focused on looking for alternative solution concepts or further restrictions in the type of games~\citep{li2022minimaxoptimal,song2022whenlearningGenSum}.
%%Designing algorithms that does not present this problem is subject of current research~\citep{zhang2021multi}.  