\section{Conclusion}
\label{sec:conclusion}
%
We have shown the sample-efficiency of Nash Q-learning under linear function approximation -- ideal for large state spaces or continuous ones -- by making use of the principle of optimism in the face of uncertainty -- largely exploited in the modern RL literature. We also compared our result to the sample complexity obtained for single-agent RL with linear function approximation and for general-sum MARL on the tabular case. We hope our work may open the path to the future analysis of a more diverse set of MARL algorithms.% that look for stronger solution concepts such as NE.

One future research direction is 
%A direct extension of our analysis could be 
obtaining sample performance lower bounds to analyze the (closeness to) minimax optimality of general-sum MARL algorithms such as Nash Q-learning. % under the same conditions as Nash Q-learning.
%Future directions of research also include the finite-sample analysis of other MARL algorithms targeted to different classes of stochastic games and solution concepts.
Moreover, though most modern theoretical work in RL (including this paper) mostly focus on sample efficiency, it is relevant to propose and study algorithms that are also computational efficient -- for which other weaker solutions to MGs such as CE and CCE are important. Finally, another future direction would be to expand the analysis of Nash Q-learning to 
nonlinear function approximators
such as neural networks.
%other function approximation regimes such as the widely-used neural networks as .   
%
%Our work gives rise to a host of open questions that remain to be answered. The success of more intricate coordinated exploration has been demonstrated in empirical studies, while our simplistic approach is provably near-optimal. What are the theoretical justifications in support of these coordinated exploration strategies? Are there settings under which coordinated approaches provably outperform our simplistic approach, i.e., by better matching the minimax lower bound? What could be the communication, computation, and sample complexity trade-offs between coordinated exploration and exploration using only a single policy? 

%\pc{
%\paragraph{Societal impact:} Our work provides a theoretical understanding on the effect of parallel exploration in reinforcement  learning, and as such, we do not foresee any societal impact stemming from its  results.}


\begin{acknowledgements} 
We are grateful to all the reviewers and the meta-reviewer for their time and their comments to improve our paper. This work is partially supported by NSF III 2046795, CCF 1934986, NIH 1R01MH116226, NIFA award 2020-67021-32799, the Alfred P. Sloan Foundation, and Google Inc.
\end{acknowledgements}

