\documentclass[accepted]{uai2025} % for initial submission
%\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\usepackage{amstext,amsmath,amssymb,amsthm}
\usepackage{multirow,array}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{wrapfig}
\newtheorem{proposition}{Proposition}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\usepackage[font=small]{subcaption}
\usepackage{color}
\hypersetup{
	colorlinks=true,%
	citecolor={blue!50!black},
	linkcolor={red!50!black},
	urlcolor={green!50!black}
}
\usepackage[font=small]{caption}
\usepackage{bm}

\allowdisplaybreaks

\title{Best Possible Q-Learning}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Jiechuan Jiang}
\author[1]{Zonngqing Lu\thanks{Correspondence to Zongqing Lu <\href{mailto:zongqing.lu@pku.edu.cn}{zongqing.lu@pku.edu.cn}>.}}
% Add affiliations after the authors
\affil[1]{%
    School of Computer Science\\
    Peking University
}
  
  \begin{document}
\maketitle

\begin{abstract}
  Fully decentralized learning, where the global information, \textit{i.e.}, the actions of other agents, is inaccessible, is a fundamental challenge in cooperative multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose \textit{best possible operator}, a novel decentralized operator, and prove that the policies of cooperative agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator when there is only one optimal joint policy. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, \textit{best possible Q-learning} (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.
\end{abstract}

\section{Introduction}

Cooperative multi-agent reinforcement learning (MARL) trains a group of agents to maximize the cumulative shared reward, which has great significance for real-world applications, including logistics \citep{li2019cooperative}, traffic signal control \citep{xu2021hierarchically}, power dispatch \citep{wang2021multi}, and games \citep{vinyals2019grandmaster}. Although most existing methods follow the paradigm of centralized training and decentralized execution (CTDE), in many scenarios where the information of all agents is unavailable in the training period, each agent has to learn independently without centralized information. Thus, \textit{fully decentralized learning}, where the agents can only use local experiences without the actions of other agents, is highly desirable \citep{jiangi2q}. %deserves more thorough research.

However, in fully decentralized learning, as other agents are treated as a part of the environment and are updating their policies simultaneously, the transition probabilities from the perspective of individual agents will be non-stationary. Thus, the convergence of most decentralized algorithms, \textit{e.g.}, independent Q-learning (IQL) \citep{tan1993multi}, is not theoretically guaranteed. Multi-agent alternate Q-learning (MA2QL) \citep{su2022ma2ql} guarantees the convergence to a Nash equilibrium, but the converged equilibrium may not be the optimal one when there are multiple equilibria \citep{zhang2021multi}. Distributed IQL \citep{lauer2000algorithm} and I2Q \citep{jiangi2q} can learn the optimal joint policy, yet are limited to deterministic environments. 
How to guarantee the convergence of the optimal joint policy in \textit{stochastic environments} remains open.

To tackle this challenge, we propose \textit{\textbf{best possible operator}}, a novel decentralized operator to update the individual state-action value of each agent, and prove that the policies of agents converge to the optimal joint policy under this operator when there is only one optimal joint policy. However, it is inefficient and thus impractical to perform best possible operator, because at each update it needs to compute the expected values of all possible transition probabilities and update the state-action value to be the maximal one. Therefore, we further propose \textit{simplified best possible operator}. At each update, the simplified operator only computes the expected value of one of the possible transition probabilities and monotonically updates the state-action value. We prove that the policies of agents also converge to the optimal joint policy under the simplified operator. We respectively instantiate the simplified operator with Q-table for tabular cases and with neural networks for complex environments. In the Q-table instantiation, non-stationarity is instinctively avoided, and in the neural network instantiation, non-stationarity in the replay buffer is no longer a drawback, but a necessary condition for convergence.

The proposed algorithm, \textit{\textbf{best possible Q-learning}} (\textbf{BQL}), is fully decentralized, without using the information of other agents. We evaluate BQL on a variety of multi-agent cooperative tasks, \textit{i.e.}, stochastic games, MPE-based differential games \citep{lowe2017multi}, Multi-Agent MuJoCo \citep{de2020deep}, SMAC \citep{samvelyan19smac}, and GRF \citep{kurach2020google}, covering fully and partially observable, deterministic and stochastic, discrete and continuous environments. Empirically, BQL substantially outperforms baselines. To the best of our knowledge, BQL is the first decentralized algorithm that guarantees the convergence to the global optimum in stochastic environments. More simplifications and instantiations of \textit{best possible operator} can be further explored. We believe BQL can be a new paradigm for fully decentralized learning.



\section{Method}


\subsection{Preliminaries}

Consider $N$-agent MDP \citep{oliehoek2016concise} $M_{\mathrm{env}}= < \mathcal{S},\mathcal{O}, \mathcal{A}, R, P_{\mathrm{env}}, \gamma>$ with the state space $\mathcal{S}$ and the joint action space $\mathcal{A}$. Each agent $i$ chooses an individual action $a_i$, and the environment transitions to the next state $s'$ by taking the joint action $\boldsymbol{a}$ with the transition probabilities $P_{\mathrm{env}}\left(s^{\prime} | s, \boldsymbol{a}\right)$. Although in theoretical analysis, we assume all agents obtain the state $s$, in practice each agent $i$ can make decisions using local observation $o_i \in \mathcal{O}$ or trajectory. All agents obtain a shared reward $r = R\left(s,s'\right) \in [r_{\min},r_{\max}]$ and learn to maximize the expected discounted return $\mathbb{E} \sum_{t=0}^{\infty} \gamma^{t} r_{t}$. In fully decentralized setting, $M_{\mathrm{env}}$ is partially observable, since each agent $i$ only observes its own action $a_i$ instead of the joint action $\boldsymbol{a}$. From the perspective of each agent $i$, there is an MDP $M_{i} = <\mathcal{S},\mathcal{A}_i, R, P_{i}, \gamma>$ with the individual action space $\mathcal{A}_i$ and the transition probabilities
\begin{equation}
\hspace{-3pt}
\label{eq:p}
 P_{i}\left(s^{\prime} | s, a_i\right)={\sum}_{\boldsymbol{a}_{-i}}P_{\mathrm{env}}\left(s^{\prime} | s, a_i,\boldsymbol{a}_{-i}\right)  \boldsymbol{\pi}_{-i}(\boldsymbol{a}_{-i} | s)
%\setlength{\belowdisplayskip}{3pt}
\end{equation}
where $\boldsymbol{\pi}_{-i}$ denotes the joint policy of all agents except agent $i$, similarly for $\boldsymbol{a}_{-i}$.
According to (\ref{eq:p}), the transition probabilities $P_{i}$ depend on the policies of other agents $\boldsymbol{\pi}_{-i}$. As other agents are updating their policies continuously, $P_i$ becomes \textit{non-stationary}. On the non-stationary transition probabilities, the convergence of independent Q-learning\footnote{For simplicity, we refer to the optimal value $Q^*$ as $Q$ in this paper, unless stated otherwise.}
\begin{equation}
%\setlength{\abovedisplayskip}{3pt}
\label{eq:q}
Q_i(s,a_i) = \mathbb{E}_{P_i(s'|s,a_i)}\left [r + \gamma \underset{a'_{i}}{\max}Q_i(s',a'_{i}) \right ]
%\setlength{\belowdisplayskip}{3pt}
\end{equation}
is not guaranteed, and how to learn the optimal joint policy in fully decentralized settings is quite a challenge. In the next section, we propose \textit{best possible operator}, a novel fully decentralized operator, which guarantees the convergence to the optimal joint policy in stochastic environments.

\subsection{Best Possible Operator}

First, let us consider the optimal joint Q-value
\begin{equation}
\label{eq:joint_q}
Q(s, \bm{a}) = \mathbb{E}_{P_{\mathrm{env}}(s'|s,\bm{a})}\left [r + \gamma \underset{\bm{a}'}{\max} Q(s', \bm{a}') \right ],
\end{equation}
which is the expected return of the optimal joint policy $\boldsymbol{\pi}^*(s) = \arg \max_{\boldsymbol{a}} Q(s,\boldsymbol{a}) $. Based on the optimal joint Q-value, for each agent $i$, we define $\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i})$, which follows the fixed point equation:
\begin{align}
&\notag \underset{\boldsymbol{a}_{-i}}{\max}Q(s,a_i,\boldsymbol{a}_{-i}) \\
=&\underset{\boldsymbol{a}_{-i}}{\max} \ \mathbb{E}_{P_{\mathrm{env}}\left(s^{\prime} | s, \boldsymbol{a}\right)  }\left[r + \gamma \underset{{a}'_{i}}{\max} \ \underset{\boldsymbol{a}'_{-i}}{\max}Q(s,{a}'_i,\boldsymbol{a}'_{-i}) \right ] \label{eq:them_eq2} \\
=& \mathbb{E}_{P_{\mathrm{env}}\left(s^{\prime} | s, a_i, \boldsymbol{\pi}^*_{-i}\right)}\left[r + \gamma \underset{{a}'_{i}}{\max} \ \underset{\boldsymbol{a}'_{-i}}{\max}Q(s,{a}'_i,\boldsymbol{a}'_{-i}) \right ] \label{eq:them_eq3}
\end{align}
where $\boldsymbol{\pi}_{-i}^*(s,a_i) = {\arg \max}_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i})$ is the optimal conditional joint policy of other agents given $a_i$. (\ref{eq:them_eq2}) is from taking $\max_{\bm{a}_{-i}}$ on both sides of (\ref{eq:joint_q}), and (\ref{eq:them_eq3}) is by folding  $\boldsymbol{\pi}_{-i}^*(s,a_i)$ into $P_{\mathrm{env}}$. Then we have the following lemma.
\begin{lemma}
\label{lemma:1}
If each agent $i$ learns the independent value function $Q_i(s,a_i) = \max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i})$, and takes actions as $\arg \max_{a_i} Q_i(s,a_i)$, the agents will obtain the optimal joint policy when there is only one optimal joint policy\footnote{We can use the simple solution proposed in I2Q to deal with the limitation of only one joint policy, which is included in Appendix~\ref{appendix:coordination}.}.
\end{lemma}
\begin{proof}
As $\max_{{a}_{i}} \max_{\boldsymbol{a}_{-i}}Q(s,{a}_i,\boldsymbol{a}_{-i}) = \max_{\bm{a}}Q(s, \bm{a})$ and there is only one optimal joint policy, $\arg \max_{a_i} Q_i(s,a_i)$ is the action of agent $i$ in the optimal joint action $\bm{a}$.
\end{proof}
According to Lemma \ref{lemma:1}, to obtain the optimal joint policy is to let each agent $i$ learn the value function $Q_i(s,a_i) = \max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i})$. To this end, we propose \textit{a new operator} to update $Q_i$ in a fully decentralized way:
\begin{equation}
\label{eq:bql}
\hspace*{-0.22cm}
Q_i(s,a_i) = \underset{P_i(\cdot|s,a_i)}{\max}\mathbb{E}_{P_i(s'|s,a_i)}\left [r + \gamma \underset{a'_{i}}{\max}Q_i(s',a'_{i}) \right ]
\end{equation}Given $s$ and $a_i$, there will be numerous $P_i(s'|s,a_i)$ due to different other agents' policies $\boldsymbol{\pi}_{-i}$. To reduce the complexity, we only consider the deterministic policies, because when there is only one optimal joint policy, the optimal joint policy must be deterministic \citep{puterman1994markov}. So the operator (\ref{eq:bql}) takes the maximum only over the transition probabilities $P_i(s'|s,a_i)$ under \textit{deterministic} $\boldsymbol{\pi}_{-i}$.
%We only consider the deterministic policies, so the operator (\ref{eq:bql}) takes the maximum only over the transition probabilities $P_i(s'|s,a_i)$ under \textit{deterministic} $\boldsymbol{\pi}_{-i}$, which greatly reduces the complexity. 
Intuitively, the operator continuously pursues the `best possible expected return', until $Q_i$ reaches the optimal expected return $\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i})$, so we name the operator (\ref{eq:bql}) \textbf{\textit{best possible operator}}. In the following, we theoretically prove that $Q_i(s,a_i)$ converges to $\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i})$ under best possible operator, thus the agents learn the optimal joint policy. Let $Q_i^k(s,a_i)$ denote the value function in the update $k$ and $Q_i(s,a_i) := Q_i^{\infty}(s,a_i)$. Then, we have the following lemma.
\begin{lemma}
\label{lemma:2}
If $Q_i^0$ is initialized to be the minimal return $\frac{r_{\min}}{1-\gamma}$, $\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i}) \geq Q^k_i(s,a_{i}), \forall s, a_i, \forall k, $ under best possible operator. 
\end{lemma}
\begin{proof}
	We prove the lemma by induction. First, as $Q_i^0$ is initialized to be the minimal return, $\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i}) \geq Q^0_i(s,a_{i})$. Then, suppose $\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i}) \geq Q^{k-1}_i(s,a_{i}),\, \forall s, a_i$. By denoting $\arg \max_{P_i(s'|s,a_i)}\mathbb{E}_{P_i(s'|s,a_i)}\left [r + \gamma \max_{a'_{i}}Q^{k-1}_i(s',a'_{i})\right]$ as $P^*_i(s'|s,a_i)$, we have
	\begin{align*}
	&\underset{\boldsymbol{a}_{-i}}{\max}Q(s,a_i,\boldsymbol{a}_{-i}) - Q^k_i(s,a_i)\\
	=  & \underset{\boldsymbol{a}_{-i}}{\max}\sum_{s'}P_{\mathrm{env}}\left(s'  | s, a_i, \boldsymbol{a}_{-i}\right) \left[r + \gamma \max_{{a}'_i,{a}'_{-i}}Q(s',{a}'_i,\boldsymbol{a}'_{-i})\right] \\ &-\sum_{s'}P^*_i(s'|s,a_i)\left[r + \gamma \underset{a'_{i}}{\max}Q^{k-1}_i(s',a'_{i})\right]\\
	\geq &\sum_{s'}P^*_i(s'|s,a_i) \left[r + \gamma \max_{{a}'_i}\underset{\boldsymbol{a}'_{-i}}{\max}Q(s',{a}'_i,\boldsymbol{a}'_{-i})\right]  \\
    &- \sum_{s'}P^*_i(s'|s,a_i)\left[r + \gamma \underset{a'_{i}}{\max}Q^{k-1}_i(s',a'_{i})\right]\\
	= & \gamma \sum_{s'}P^*_i\left(\max_{{a}'_i,\boldsymbol{a}'_{-i}}Q(s',{a}'_i,\boldsymbol{a}'_{-i}) - \max_{{a}'_i}Q^{k-1}_i(s',a'_{i})\right)\\
	\geq & \gamma \sum_{s'}P^*_i(s'|s,a_i)\left(\underset{\boldsymbol{a}'_{-i}}{\max}Q(s',{a}'^{*}_i,\boldsymbol{a}'_{-i}) -Q^{k-1}_i(s',{a}'^{*}_i)\right) \\
    \geq & 0, 
	\end{align*}where ${a}'^{*}_i = \arg \max_{a'_i} Q^{k-1}_i(s',{a}'_i)$. Thus, it holds in the update $k$. By the principle of induction, the lemma holds for all updates.
\end{proof}
Intuitively, $ \max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i}) $ is the optimal expected return after taking action $a_i$, so it is the upper bound of $Q_i(s,a_i)$. Further, based on Lemma \ref{lemma:2}, we have the following lemma.
\begin{lemma}
	\label{theorem:1}
	$Q_i(s,a_i)$ converges to $\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i})$ under best possible operator.
\end{lemma}
\begin{proof} 
	For clear presentation, we use $P_{\mathrm{env}}\left(s'  | s, a_i, \boldsymbol{\pi}^*_{-i}\right)$ to denote $P_{\mathrm{env}}\left(s'  | s, a_i, \boldsymbol{\pi}^*_{-i}(s,a_i)\right)$. From (\ref{eq:them_eq3}) and (\ref{eq:bql}), we have
	%the definition of $Q_i(s,a_i)$ (\ref{eq:bql}) and $\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i})$ (\ref{eq:them_eq3}), we have
	\begin{align*}
	&\Big\|\underset{\boldsymbol{a}_{-i}}{\max}Q(s,a_i,\boldsymbol{a}_{-i}) - Q^k_i(s,a_i) \Big\|_{\infty} \\
    =& \underset{s, a_i }{ \max } \left( \sum_{s'}P_{\mathrm{env}}\left(s'  | s, a_i, \boldsymbol{\pi}^*_{-i}\right) \left[r + \gamma \max_{{a}'_i}\underset{\boldsymbol{a}'_{-i}}{\max}Q(s',{a}'_i,\boldsymbol{a}'_{-i})\right] \right.\\
	&   \left.  \qquad - \sum_{s'}P^*_i(s'|s,a_i)\left[r + \gamma \underset{a'_{i}}{\max}Q^{k-1}_i(s',a'_{i})\right] \right) \\
	\leq & \underset{s, a_i }{ \max } \left( \sum_{s'}P_{\mathrm{env}}\left(s'  | s, a_i, \boldsymbol{\pi}^*_{-i}\right) \left[r + \gamma \max_{{a}'_i}\underset{\boldsymbol{a}'_{-i}}{\max}Q(s',{a}'_i,\boldsymbol{a}'_{-i})\right] \right.\\
	& \quad  \left. - \sum_{s'}P_{\mathrm{env}}\left(s'  | s, a_i, \boldsymbol{\pi}^*_{-i}\right)\left[r + \gamma \underset{a'_{i}}{\max}Q^{k-1}_i(s',a'_{i})\right] \right)\\
	\leq & \gamma \underset{s', a'_i }{ \max } \left(\underset{\boldsymbol{a'}_{-i}}{\max}Q(s',a'_i,\boldsymbol{a'}_{-i}) - Q^{k-1}_i(s',a'_i)\right)\\
	 = & \gamma \Big\|\underset{\boldsymbol{a}_{-i}}{\max}Q(s,a_i,\boldsymbol{a}_{-i}) - Q^{k-1}_i(s,a_i) \Big\|_{\infty}.
	\end{align*}We have 
	\begin{align*}
    &\left\|\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i}) - Q^k_i(s,a_i) \right\|_{\infty} \leq \\
    &\qquad \qquad \qquad \gamma^k\left\|\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i}) - Q^0_i(s,a_i) \right\|_{\infty} 
    \end{align*}
    Let $k \rightarrow \infty$, then $Q_i(s,a_i) \rightarrow \max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i}) $, thus the lemma holds.
\end{proof}
%Theorem \ref{theorem:1} tells that if each agent $i$ independently performs \textit{best possible operator} to update $Q_i(s,a_i)$, $Q_i(s,a_i)$ converges to $\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i})$. Then, according to Lemma~\ref{lemma:1}, the agents learn the optimal joint policy. 

According to Lemma~\ref{lemma:1} and \ref{theorem:1}, we immediately have:
\begin{theorem}
	\label{theorem:3}
	The agents learn the optimal joint policy under best possible operator when there is only one optimal joint policy.
\end{theorem}


\subsection{Simplified Best Possible Operator}

Best possible operator guarantees the convergence to the optimal joint policy. However, to perform (\ref{eq:bql}), every update, each agent $i$ has to compute the expected values of all possible transition probabilities and update $Q_i$ to be the maximal expected value, which is too costly. Therefore, we introduce an auxiliary value function $Q^{\mathrm{e}}_i(s,a_i)$, and simplify (\ref{eq:bql}) into two operators. First, at each update, we randomly select one of possible transition probabilities $\tilde{P}_i$ for each $(s,a_i)$ and update $Q^{\mathrm{e}}_i(s,a_i)$ by
\begin{equation}
\label{eq:bql-1}
Q^{\mathrm{e}}_i(s,a_i) = \mathbb{E}_{\tilde{P}_i(s'|s,a_i)}\left [r + \gamma \underset{a'_{i}}{\max}Q_i(s',a'_{i}) \right ].
\end{equation}
$Q^{\mathrm{e}}_i(s,a_i)$ represents the expected value of the selected transition probabilities. Then we monotonically update $Q_i(s,a_i)$ by
\begin{equation}
\label{eq:bql-2}
Q_i(s,a_i) = \max\left(Q_i(s,a_i),Q^{\mathrm{e}}_i(s,a_i)\right).
\end{equation}
We define (\ref{eq:bql-1}) and (\ref{eq:bql-2}) together as \textit{\textbf{simplified best possible operator}}. By performing simplified best possible operator, $Q_i(s,a_i)$ is efficiently updated towards the maximal expected value. And we have the following lemma.
%After simplifying operator~\ref{eq:bql} into two operators~\ref{eq:bql-1} and~\ref{eq:bql-2}, instead of taking the max expectation over all possible transition probabilities, each update we only compute expectation on random one of possible transition probabilities by (\ref{eq:bql-1}) and monotonically update $Q_i(s,a_i)$ by (\ref{eq:bql-2}). By alternately performing (\ref{eq:bql-1}) and (\ref{eq:bql-2}), $Q_i(s,a_i)$ softly and efficiently captures the max expectation. And we provide the theorem:
\begin{lemma}
	\label{theorem:2}
	$Q_i(s,a_i)$ converges to $\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i})$ under simplified best possible operator.
\end{lemma}
\begin{proof}
	According to~(\ref{eq:bql-2}), as $Q_i(s,a_i)$ is monotonically increased, $Q^k_i(s,a_i) \geq Q^{k-1}_i(s,a_i)$ in the update $k$. Similar to the proof of Lemma~\ref{lemma:2}, we can easily prove $\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i}) \geq Q^k_i(s,a_{i})$ under (\ref{eq:bql-1}) and (\ref{eq:bql-2}). Thus, $\{Q^k_i(s,a_i)\}$ is an increasing sequence and bounded above. According to the monotone convergence theorem,  $\{Q^k_i(s,a_i)\}$ converges when $k \rightarrow \infty$, and let $Q_i(s,a_i) := Q^\infty_i(s,a_i)$. 
	
	Then we prove that the converged value $Q_i(s,a_i)$ is equal to $\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i})$. Due to monotonicity and convergence, $$\forall \epsilon, s, a_i, \exists K, \text{when } k>K,\, Q^k_i(s,a_i) - Q^{k-1}_i(s,a_i) \leq \epsilon,$$no matter which $\tilde{P}_i$ is selected in the update $k$. Since each $\tilde{P}_i$ can be selected, when selecting $\tilde{P}_i(s'|s,a_i) =  \arg \max_{P_i(s'|s,a_i)}\mathbb{E}_{P_i(s'|s,a_i)}\left [r + \gamma \max_{a'_{i}}Q^{k-1}_i(s',a'_{i})\right] = P^*_i(s'|s,a_i),$
	by performing (\ref{eq:bql-1}) and (\ref{eq:bql-2}), we have
	\begin{align*}
	&Q^{k-1}_i(s,a_i) + \epsilon \geq Q^k_i(s,a_i) \geq Q^{\mathrm{e}}_i(s,a_i) \\
    &= \sum_{s'}P^*_i(s'|s,a_i)\left[r(s,s') + \gamma \underset{a'_{i}}{\max}Q^{k-1}_i(s',a'_{i})\right] .
	\end{align*}
	As in the proof of Lemma~\ref{theorem:1}, we have
	\begin{align*}
	&\underset{s, a_i }{ \max }\left(\underset{\boldsymbol{a}_{-i}}{\max}Q(s,a_i,\boldsymbol{a}_{-i}) - Q^{\mathrm{e}}_i(s,a_i) \right) \\
    &\leq \gamma\underset{s, a_i }{ \max }\left(\underset{\boldsymbol{a}_{-i}}{\max}Q(s,a_i,\boldsymbol{a}_{-i}) - Q^{k-1}_i(s,a_i) \right).
	\end{align*}
	Use $s^*, a^*_i$ to denote $$\arg \max_{s, a_i } \left(\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i}) - Q^{k-1}_i(s,a_i) \right).$$ Since $Q^{k-1}_i(s,a_i) + \epsilon \geq Q^{\mathrm{e}}_i(s,a_i),$
	\begin{align*}
	&\underset{\boldsymbol{a}_{-i}}{\max}Q(s^*,a^*_i,\boldsymbol{a}_{-i}) - Q^{k-1}_i(s^*,a^*_i) - \epsilon \\
    &\leq \gamma \underset{\boldsymbol{a}_{-i}}{\max}Q(s^*,a^*_i,\boldsymbol{a}_{-i}) - \gamma Q^{k-1}_i(s^*,a^*_i) .
	\end{align*}
	Then, we have
	$$\Big\|\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i}) - Q^{k-1}_i(s,a_i) \Big\|_{\infty} \leq \frac{\epsilon}{1 - \gamma}.$$
	Thus, $Q_i(s,a_i)$ converges to $\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i})$.
\end{proof}
%In the next section, we will show that this simplification is meaningful and practical in complex environments.
According to Lemma \ref{lemma:1} and \ref{theorem:2}, we also have:
\begin{theorem}
	\label{theorem:4}
	The agents learn the optimal joint policy under simplified best possible operator when there is only one optimal joint policy.
\end{theorem}
%In the next section, we show that this simplified operator is more practical in complex environments.

\subsection{Best Possible Q-Learning}
\label{sec:bql}

\textbf{\textit{Best possible Q-learning}} (BQL) is instantiated on simplified best possible operator. We first consider learning Q-table for tabular cases. %To implement simplified best possible operator, %the simplified operator~\ref{eq:bql-1} and~\ref{eq:bql-2}, 
The key challenge is how to obtain all possible transition probabilities under deterministic $\boldsymbol{\pi}_{-i}$ during learning. To solve this issue, the whole training process is divided into $M$ epochs. At epoch $m$, each agent $i$ randomly and independently initializes a deterministic policy $\hat{\pi}_i^m$ and selects a subset of states $S_i^m$. Then each agent $i$ interacts with the environment using the deterministic policy
$$\left\{\begin{matrix}
\arg \max_{a_i} Q_i(s,a_i) & \text{if } s \notin S_i^m,\\ 
\hat{\pi}_i^m(s) & \text{else}.
\end{matrix}\right.$$
Each agent $i$ stores independent experiences $(s,a_i,s',r)$ in the replay buffer $\mathcal{D}_i^m$. As $P_i$ depends on $\boldsymbol{\pi}_{-i}$ and agents act according to deterministic policies, $\mathcal{D}_i^m$ contains one $P_i$ under a deterministic $\boldsymbol{\pi}_{-i}$. Since $P_i$ will change if other agents modify their policies $\boldsymbol{\pi}_{-i}$, acting the randomly initialized policy $\hat{\pi}_i^m$ on $S_i^m$ in epoch $m$ not only helps each agent $i$ to explore state-action pairs, but also helps other agents to explore possible transition probabilities. When $M$ is sufficiently large, given any $(s,a_i)$ pair, any $P_i(s,a_i)$ can be found in a replay buffer.

After interaction of epoch $m$, each agent $i$ has a buffer series $\{\mathcal{D}_i^1,\cdots,\mathcal{D}_i^m\}$, each of which has different transition probabilities. At training period of epoch $m$, each agent $i$ randomly selects one replay buffer $\mathcal{D}_i^j$ from $\{\mathcal{D}_i^1,\cdots,\mathcal{D}_i^m\}$ and samples mini-batches $\{s,a_i,s',r\}$ from $\mathcal{D}_i^j$ to update Q-table $Q^{\mathrm{e}}_i(s,a_i)$ by (\ref{eq:bql-1}), and then samples mini-batches from $\mathcal{D}_i^j$ to update $Q_i(s,a_i)$ by (\ref{eq:bql-2}). The Q-table implementation is summarized in Algorithm~\ref{alg:1}.


The sample efficiency of collecting the buffer series seems to be a limitation of BQL, and we further analyze it. Simplified best possible operator requires that any possible $P_i(s,a_i)$ of $(s,a_i)$ pair can be found in one buffer, but does not care about the relationship between transition probabilities of different state-action pairs in the same buffer. So BQL ideally needs only $|\mathcal{A}_i| \times |\mathcal{A}_{-i}| = |\mathcal{A}|$ small buffers to cover all possible $P_i$ for any $(s,a_i)$ pair, which is very efficient for experience collection. We give an intuitive illustration for this and analyze that BQL has similar sample complexity to the joint Q-learning~(\ref{eq:joint_q}) in Appendix~\ref{appendix:buffer}.


In complex environments with large or continuous state-action space, it is inefficient and costly to follow the experience collection in tabular cases, where the agents cannot update their policies during the interaction of each epoch and each epoch requires adequate samples to accurately estimate the expectation (\ref{eq:bql-1}). Thus, in complex environments, same as IQL, each agent $i$ only maintains one replay buffer $\mathcal{D}_i$, which contains all historical experiences, and uses the same $\epsilon$-greedy policy as IQL (without the randomly initialized deterministic policy $\hat{\pi}_i$). Then we instantiate simplified best possible operator with neural networks $Q_i$ and $Q^{\mathrm{e}}_i$. $Q^{\mathrm{e}}_i$ is updated by minimizing:
\begin{align}
\label{eq:1}
\mathbb{E}_{s,a_i,s',r \sim \mathcal{D}_i}  &\left[ \left(Q^{\mathrm{e}}_i\left(s, a_i\right)-r-\gamma Q_i(s^{\prime}, a_i^{\prime*})\right)^2\right], \\
\notag a_i^{\prime*} &= \arg \max_{{a}'_i} {Q}_i(s',{a}'_i).
\end{align}
And $Q_i$ is updated by minimizing:
\begin{align}
\label{eq:2}
\mathbb{E}_{s,a_i \sim \mathcal{D}_i} \left[ w(s, a_i)\left(Q_i\left(s, a_i\right)- \bar{Q}^{\mathrm{e}}_i(s, a_i) \right)^2\right], \\ 
\notag w(s, a_i) = \left\{\begin{matrix}
1 & \text{if }  \bar{Q}^{\mathrm{e}}_i(s, a_i) > Q_i\left(s, a_i\right)\\ 
\lambda & \text{else}.
\end{matrix}\right.
%\setlength{\belowdisplayskip}{3pt}
\end{align}
$\bar{Q}^{\mathrm{e}}_i$ is the softly updated target network of $Q^{\mathrm{e}}_i$. When $\lambda = 0$, (\ref{eq:2}) is equivalent to (\ref{eq:bql-2}). However, when $\lambda = 0$, the positive random noise of $Q_i$ in the update can be continuously accumulated, which may cause value overestimation. So we adopt the weighted max in (\ref{eq:2}) by setting $0<\lambda<1$ to offset the positive random noise. In continuous action space, following DDPG \citep{lillicrap2016continuous}, we train a policy network $\pi_i(s)$ by maximizing $Q_i(s,\pi_i(s))$ as a substitute of $\arg \max_{a_i} {Q}_i(s,a_i)$. The neural network implementation is summarized in Algorithm~\ref{alg:2}.

Simplified best possible operator is meaningful for neural network implementation. As there is only one buffer $\mathcal{D}_i$, we cannot perform (\ref{eq:bql}) but can still perform (\ref{eq:bql-1}) and (\ref{eq:bql-2}) on $\mathcal{D}_i$. As other agents are updating their policies, the transition probabilities in $\mathcal{D}_i$ will continuously change. If $\mathcal{D}_i$ sufficiently goes through all possible transition probabilities, $Q_i(s,a_i)$ converges to $\max_{\boldsymbol{a}_{-i}}Q(s,a_i,\boldsymbol{a}_{-i})$ and the agents learn the optimal joint policy. That is to say, \textit{non-stationarity in the replay buffer is no longer a drawback, but a necessary condition for BQL}.



\section{Related Work}

Most existing MARL methods \citep{lowe2017multi,iqbal2019actor,wang2020dop,zhang2021fop, su2021divergence, peng2021facmac,li2022difference,sunehag2018value,rashid2018qmix,son2019qtran} follow centralized training and decentralized execution (CTDE), where the information of all agents can be accessed in a centralized way during training. Unlike these methods, we focus on fully decentralized learning where global information is not available. The most straightforward decentralized methods, \textit{i.e.}, independent Q-learning \citep{tan1993multi} and independent PPO (IPPO) \citep{de2020independent}, cannot guarantee the convergence of the learned policy, because the transition probabilities are non-stationary from the perspective of each agent as all agents are learning policies simultaneously. Multi-agent alternate Q-learning (MA2QL) \citep{su2022ma2ql} guarantees the convergence to a Nash equilibrium, but the converged equilibrium may not be the optimal one when there are multiple Nash equilibria. Moreover, to obtain the theoretical guarantee, it has to be trained in an on-policy manner and cannot use replay buffers, which leads to poor sample efficiency. Following the principle of optimistic estimation, Hysteretic IQL \citep{matignon2007hysteretic} sets a slow learning rate to the value punishment. Distributed IQL \citep{lauer2000algorithm}, a special case of Hysteretic IQL with the slow learning rate being zero, guarantees the convergence to the optimum but only in deterministic environments.  
I2Q \citep{jiangi2q} lets each agent perform independent Q-learning on ideal transition probabilities and could learn the optimal policy only in deterministic environments. 
Our BQL is the first fully decentralized algorithm that converges to the optimal joint policy in stochastic environments.

\label{appendix:algorithm}
\begin{algorithm}[t]
	\caption{BQL with Q-table for each agent $i$}
	\label{alg:1}
	\begin{algorithmic}[1]	
		\STATE Initialize tables $Q_i$ and $Q^{\mathrm{e}}_i$.			
		\FOR{$m = 1, \ldots, M$}
		\STATE Initialize the replay buffer $\mathcal{D}^m_i$ and the exploration policy $\hat{\pi}_i^m$.
		\STATE  All agents interact with the environment and store experiences $(s,a_i,s',r)$ in $\mathcal{D}^m_i$.
		\FOR{$t = 1, \ldots, n\_update$}
		\STATE Randomly select a buffer $\mathcal{D}^j_i$ from $\mathcal{D}_i^1,\cdots,\mathcal{D}_i^m$.
		\STATE Update $Q^{\mathrm{e}}_i$ according to (\ref{eq:bql-1}) by sampling from $\mathcal{D}^j_i$.
		\STATE Update $Q_i$ according to (\ref{eq:bql-2}) by sampling from $\mathcal{D}^j_i$.
		\ENDFOR		
		\ENDFOR		
	\end{algorithmic}
\end{algorithm}

\begin{algorithm}[t]
	\caption{BQL with neural network for each agent $i$}
	\label{alg:2}
	\begin{algorithmic}[1]	
		\STATE Initialize neural networks $Q_i$ and $Q^{\mathrm{e}}_i$, and the target network $\bar{Q}^{\mathrm{e}}_i$.				
		\STATE Initialize the replay buffer $\mathcal{D}_i$.
		\FOR{$t = 1, \ldots, n\_iteration$}
		\STATE  All agents interact with the environment and store experiences $(s,a_i,s',r)$ in $\mathcal{D}_i$.
		\STATE Sample a mini-batch from $\mathcal{D}_i$.
		\STATE Update $Q^{\mathrm{e}}_i$ by minimizing (\ref{eq:1}).
		\STATE Update $Q_i$ by minimizing (\ref{eq:2}).
		\STATE Update the target networks $\bar{Q}^{\mathrm{e}}_i$.
		\ENDFOR		
	\end{algorithmic}
\end{algorithm}

In the next section, we compare BQL against these Q-learning variants (Distributed IQL is included in Hysteretic IQL). 
Comparing with on-policy algorithms, \textit{e.g.}, IPPO, that are not sample-efficient especially in fully decentralized settings, is out of focus and thus deferred to Appendix. 
%As it is unfair to compare off-policy algorithms with on-policy algorithms, \textit{e.g.}, IPPO, due to sample efficiency \citep{SpinningUp2018}, this is out of focus and deferred to Appendix.
Decentralized methods with communication \citep{zhang2018fully,konan2021iterated,li2020multi} allow information sharing with neighboring agents according to a communication channel. However, they do not follow the fully decentralized setting and thus are beyond the scope of this paper.

\section{Experiments}


We first test BQL with Q-table on randomly generated cooperative stochastic games to verify its convergence and optimality. Then, to illustrate its performance on complex tasks, we compare BQL with neural networks against Q-learning variants on MPE-version differential games \citep{jiangi2q}, Multi-Agent MuJoCo \citep{peng2021facmac}, SMAC \citep{samvelyan19smac}, and GRF \citep{kurach2020google}. The experiments cover both fully and partially observable, deterministic and stochastic, discrete and continuous environments. Since we consider the fully decentralized setting, BQL and the baselines \textit{do not use parameter sharing}. The results are presented using mean and standard deviation. More details about hyperparameters are available in Appendix~\ref{appendix:hyperparameters}.


\subsection{Stochastic Games}


To support the theoretical analysis of BQL, we test the Q-table instantiation on stochastic games with $4$ agents, $30$ states, and infinite horizon. The action space of each agent is $4$, so the joint action space $|\mathcal{A}| = 256$. The distribution of initial states is uniform. Each state will transition to any state given a joint action according to transition probabilities. The transition probabilities and reward function are randomly generated and fixed in each game. We randomly generate $20$ games and train the agents for four different seeds in each game.

The mean normalized return and std over the $20$ games are shown in Figure~\ref{fig:game}. IQL cannot learn the optimal policies due to non-stationarity. Although using the optimistic update to remedy the non-stationarity, Hysteretic IQL (H-IQL) still cannot solve this problem in stochastic environments and shows similar performance to IQL. In Appendix~\ref{appendix:hysteretic}, we thoroughly analyze the difference and relationship between H-IQL and BQL. I2Q performs Q-learning on the ideal transition function where the next state is deterministically the one with the highest value, which however is impossible in stochastic tasks. So I2Q cannot guarantee the optimal joint policy in stochastic environments. MA2QL guarantees the convergence to a Nash equilibrium, but the converged one may not be the optimal one, thus there is a performance gap between MA2QL and optimal policies. BQL could converge to the optimum, and the tiny gap is caused by the fitting error of the Q-table update. This verifies our theoretical analysis. Note that, in Q-table instantiations, MA2QL and BQL use different experience collection from IQL, \textit{i.e.}, exploration strategy and replay buffer. MA2QL only uses on-policy experiences and BQL collects a series of small buffers. However, \textit{for sample efficiency, the two methods have to use the same experience collection as IQL in complex tasks with neural networks.} MA2QL- and BQL- respectively denote the two methods with the same experience collection as IQL. Trained on off-policy experiences, MA2QL- suffers from non-stationarity and achieves similar performance to IQL. Even if using only one buffer, as we have analyzed in Section~\ref{sec:bql}, if the non-stationary buffer sufficiently goes through all possible transition probabilities, BQL agents can also converge to the optimum. Although going through all possible transition probabilities by one buffer is inefficient, BQL- significantly outperforms IQL, which implies the potential of BQL with one buffer in complex tasks.


\begin{figure*}[!t]
	\centering
	\setlength{\abovecaptionskip}{3pt}
	\begin{subfigure}[t]{0.253\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/p1.pdf}
		\caption{stochastic games}	
		\label{fig:game}	
	\end{subfigure}
	\hspace{-0.25cm}
	\begin{subfigure}[t]{0.253\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/p2.pdf}
		\caption{$|\mathcal{D}_i^m|$}	
		\label{fig:size}	
	\end{subfigure}
	\hspace{-0.25cm}
	\begin{subfigure}[t]{0.253\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/p3.pdf}
		\caption{$|S_i^m|$}	
		\label{fig:s}	
	\end{subfigure}
	\hspace{-0.25cm}
	\begin{subfigure}[t]{0.253\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/p4.pdf}
		\caption{one-stage game}
		\label{fig:matrix}		
	\end{subfigure}
	\caption{Learning curves on cooperative stochastic games (normalized by the optimal return).}
\end{figure*}

Figure~\ref{fig:size} shows the effect of the size of buffer $\mathcal{D}_i^m$ at epoch $m$. If $|\mathcal{D}_i^m|$ is too small, \textit{i.e.}, $200$, the experiences in $|\mathcal{D}_i^m|$ are insufficient to accurately estimate the expected value~(\ref{eq:bql-1}). If $|\mathcal{D}_i^m|$ is too large, \textit{i.e.}, $10000$, the experiences in $|\mathcal{D}_i^m|$ are redundant, and the buffer series is has difficulties covering all possible transition probabilities given fixed total training timesteps. Figure~\ref{fig:s} shows the effect of the number of states on which the agents perform the randomly initialized deterministic policy $\hat{\pi}_i^m$ for exploration. The larger $|S_i^m|$ means a stronger exploration for both state-action pairs and possible transition probabilities, which leads to better performance.

We then consider a one-stage game that is widely adopted in MARL \citep{son2019qtran}. There are $2$ agents, and the action space of each agent is $3$. The reward matrix is
\begin{equation}
\notag
%\setlength{\abovedisplayskip}{3pt}
\begin{vmatrix}
a_1/a_2 & \mathcal{A}^{(1)} & \mathcal{A}^{(2)} &\mathcal{A}^{(3)} \\ 
\mathcal{A}^{(1)} & \mathbf{8}  & -12 &-12 \\ 
\mathcal{A}^{(2)} & -12 & 0 & 0\\ 
\mathcal{A}^{(3)} & -12 & 0 & 0
\end{vmatrix}
%\setlength{\belowdisplayskip}{3pt}
\end{equation}
where the reward $8$ is the global optimum and the reward $0$ is the sub-optimal Nash equilibrium. As shown in Figure~\ref{fig:matrix}, MA2QL converges to the sub-optimal Nash equilibrium when the initial policy of the second agent selects $\mathcal{A}^{(2)}$ or $\mathcal{A}^{(3)}$. But BQL converges to the global optimum easily. 


\subsection{MPE}

\begin{figure*}[!t]
	\centering
	\setlength{\abovecaptionskip}{3pt}
	\begin{subfigure}[t]{0.253\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/0_2.pdf}
		\caption{$\beta = 0.2$}		
	\end{subfigure}
	\hspace{-0.25cm}
	\begin{subfigure}[t]{0.253\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/0_3.pdf}
		\caption{$\beta = 0.3$}	
	\end{subfigure}
	\hspace{-0.25cm}
	\begin{subfigure}[t]{0.253\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/0_4.pdf}
		\caption{$\beta = 0.4$}	
	\end{subfigure}
	\hspace{-0.25cm}
	\begin{subfigure}[t]{0.253\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/0_5.pdf}
		\caption{$\beta = 0.5$}	
	\end{subfigure}
	\caption{Learning curves on MPE-based differential games with different $\beta$.}
	\label{fig:mpe}		
\end{figure*}

To evaluate the effectiveness of BQL with neural network implementation, we adopt the $3$-agent MPE-based differential game used in I2Q \citep{jiangi2q}, where $3$ agents can move in the range $[-1,1]$. Different from the original deterministic version, we add stochasticity to it. In each timestep, agent $i$ acts the action $a_i \in [-1,1]$, and the position of agent $i$ will be updated as $x_i = \text{clip}(x_i + 0.1\times a_i,-1,1)$ (\textit{i.e.}, the updated position is clipped to $[-1,1]$) with the probability $1- \beta $, or will be updated as $-x_i$ with the probability $\beta$. $\beta$ controls the stochasticity. The state is the vector of positions $\{x_1,x_2,x_3\}$. The reward function of each timestep is
\begin{align*}
\setlength{\abovedisplayskip}{2pt}
r = \begin{cases}
0.5 \cos(4l\pi)+0.5 & \text{ if } l\leq 0.25 \\ 
0 & \text{ if } 0.25< l\leq 0.6 \\ 
0.15 \cos(5\pi(l-0.8))+0.15 & \text{ if } 0.6< l\leq 1.0 \\
0 &\text{ if } l>1.0
\end{cases},
\setlength{\belowdisplayskip}{2pt}
\end{align*} 
where $l = \sqrt{\frac{2}{3}(x_1^2+x_2^2+x_3^2)}$. We visualize the relation between $r$ and $l$ in Figure~\ref{fig:density}. There is only one global optimum ($l = 0$ and $r = 1$) but infinite sub-optima ($l = 0.8$ and $r=0.3$), and the narrow region with $r > 0.3$ is surrounded by the region with $r = 0$. So it is quite a challenge to learn the optimal policies in a fully decentralized way. Each episode contains $100$ timesteps, and the initial positions follow the uniform distribution. We perform experiments with different stochasticities $\beta$, and train the agents for eight seeds with each $\beta$. In continuous environments, BQL and baselines are built on DDPG. 

As shown in Figure~\ref{fig:mpe}, IQL always falls into the local optimum (total reward $\approx 30$) because of the non-stationary transition probabilities. H-IQL only escapes the local optimum in one seed in the setting with $\beta = 0.3$. According to the theoretical analysis in I2Q paper, the value estimation error of I2Q will become larger when stochasticity grows, which is the reason why I2Q shows poor performance with $\beta = 0.4$ and $0.5$. In neural network implementations, MA2QL and BQL use the same experience collection as IQL, so there is no MA2QL- and BQL-. MA2QL converges to the local optimum because it cannot guarantee that the converged equilibrium is the global optimum, especially trained using off-policy data. BQL ($\lambda = 0.01$) can escape from local optimum in more than $4$ seeds in all settings, which demonstrates the effectiveness of our optimization objectives~(\ref{eq:1}) and~(\ref{eq:2}). The difference between global optimum (total reward $\approx 100$) and local optimum is large, which results in the large variance of BQL. In the objective~(\ref{eq:2}), $\lambda$ controls the balance between performing best possible operator and offsetting the overestimation caused by the operator. As shown in Figure~\ref{fig:mpe}, the large $\lambda$, \textit{i.e.}, $0.1$, will weaken the strength of BQL, while too small $\lambda$, \textit{i.e.}, $0$, will cause severe overestimation and destroy the performance.


\subsection{Multi-Agent MuJoCo}

\begin{figure*}[!t]
	\centering
	\setlength{\abovecaptionskip}{5pt}
	\begin{subfigure}[t]{0.25\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/2x3s.pdf}
		\caption{$2 \times 3$ Swimmer}		
	\end{subfigure}
	\hspace{-0.2cm}
	\begin{subfigure}[t]{0.25\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/2x4a.pdf}
		\caption{$2 \times 4$d Ant}	
	\end{subfigure}
	\hspace{-0.2cm}
	\begin{subfigure}[t]{0.25\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/6|2a.pdf}
		\caption{$6|2$ Ant}	
	\end{subfigure}
	\hspace{-0.2cm}
	\begin{subfigure}[t]{0.25\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/human.pdf}
		\caption{$17 \times 1$ Humanoid}	
            \label{fig:human}
	\end{subfigure}
	\caption{Learning curves on Multi-Agent MoJoCo.}
	\label{fig:mujoco}		

\end{figure*}

To evaluate BQL in \textit{partially observable} environments, we adopt Multi-Agent MuJoCo \citep{peng2021facmac}, where each agent independently controls one or some joints of the robot. In each task, we test four random seeds and plot the learning curves in Figure~\ref{fig:mujoco}. Here, we set $\lambda=0.5$. In the first three tasks, each agent can only observe the state of its own joints and bodies (with the parameter agent\_obsk = 0). BQL achieves higher reward or learns faster than the baselines, which verifies that BQL could be applied to partially observable environments. %In partially observable environments, BQL is performed on transition probabilities of observation $P_i(o_i'|o_i,a_i)$, which also depends on $\boldsymbol{\pi}_{-i}$. The convergence and optimality of BQL can only be guaranteed when one observation $o_i$ uniquely corresponds to one state $s$. 

In the first three tasks, we only consider two-agent cases in the partially observable setting, because the too limited observation range cannot support strong policies when there are more agents. We also test BQL on $17$-agent Humanoid with full observation in Figure~\ref{fig:human}. BQL obtains significant performance gain in this many-agent task, which can be evidence of the \textbf{good scalability} of BQL. 


\subsection{SMAC and Google Research Football}

\begin{figure*}[!t]
	\centering
	\setlength{\abovecaptionskip}{5pt}
	\begin{subfigure}[t]{0.25\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/2c_vs_64zg.pdf}
		\caption{2c\_vs\_64zg}		
	\end{subfigure}
	\hspace{-0.2cm}
	\begin{subfigure}[t]{0.25\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/2s3z.pdf}
		\caption{2s3z}	
	\end{subfigure}
	\hspace{-0.2cm}
	\begin{subfigure}[t]{0.25\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/3s5z.pdf}
		\caption{3s5z}	
	\end{subfigure}
	\hspace{-0.2cm}
	\begin{subfigure}[t]{0.25\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/1c3s5z.pdf}
		\caption{1c3s5z}	
	\end{subfigure}
	\caption{Learning curves on SMAC.}
	\label{fig:smac}		

\end{figure*}

\begin{figure*}[!t]
	\centering
	\setlength{\abovecaptionskip}{5pt}
	\begin{subfigure}[t]{0.25\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/3v1.pdf}
		\caption{3\_vs\_1 with keeper}		
		\label{fig:grf1}	
	\end{subfigure}
	\hspace{-0.2cm}
	\begin{subfigure}[t]{0.25\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/attack.pdf}
		\caption{counterattack easy}
		\label{fig:grf2}	
	\end{subfigure}
	\hspace{-0.2cm}
	\begin{subfigure}[t]{0.25\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/lamb-1.pdf}
		\caption{$2 \times 3$ Swimmer}	
		\label{fig:lamb1}	
	\end{subfigure}
	\hspace{-0.20cm}
	\begin{subfigure}[t]{0.25\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/smac_lamb.pdf}
		\caption{2c\_vs\_64zg}
		\label{fig:lamb2}	
	\end{subfigure}
	\caption{(a) and (b): Learning curves on GRF. (c) and (d): Learning curves with different $\lambda$.}		

\end{figure*}

We also perform experiments on \textit{partially observable and stochastic} SMAC tasks \citep{samvelyan19smac} with the version SC2.4.10, including both easy and hard maps \citep{yu2021surprising}. Agent numbers vary between $2$ and $9$. We build BQL on the implementation of PyMARL \citep{samvelyan19smac} and train the agents for four random seeds. The learning curves are shown in Figure~\ref{fig:smac}. In general, BQL outperforms the baselines, which verifies that BQL can also obtain performance gain in high-dimensional complex tasks. In 2c\_vs\_64zg, by considering the non-stationary transition probabilities, BQL and I2Q achieve significant improvement over other methods. We conjecture that the interplay between agents is strong in this task. 

GRF \citep{kurach2020google} is a physics-based 3D simulator where agents aim to master playing football. We select two academy tasks with sparse rewards: 3\_vs\_1 with keeper (3 agents) and counterattack easy (4 agents). We build BQL on the implementation of PyMARL2 \citep{hu2021rethinking} and train the agents for four random seeds. Although I2Q shows similar results with BQL in some SMAC tasks, BQL can outperform I2Q in GRF as shown in Figure~\ref{fig:grf1} and \ref{fig:grf2}, because GRF is more stochastic than SMAC and the value gap of I2Q will enlarge along with the increase of stochasticity.

\subsection{Hyperparameter $\lambda$}

We further investigate the effectiveness of $\lambda$ in Multi-Agent MuJoCo and SMAC. In the objective~(\ref{eq:2}), $\lambda$ controls the balance between performing best possible operator and offsetting the overestimation caused by the operator. As shown in Figure~\ref{fig:lamb1} and~\ref{fig:lamb2}, too large $\lambda$ will weaken the strength of BQL. When $\lambda = 1.0$, BQL degenerates into IQL. Too small $\lambda$, \textit{i.e.}, $0$, will cause overestimation. If the environment is more complex, \textit{e.g.}, SMAC, overestimation is more likely to occur, so we should set a large $\lambda$. In $2 \times 3$ Swimmer, when $\lambda$ falls within the interval $[0.2,0.8]$, BQL can obtain performance gain, showing the robustness to $\lambda$.


\section{Conclusion}

We propose \textit{best possible operator} and theoretically prove that the policies of agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator. We then simplify the operator and derive BQL, the first decentralized MARL algorithm that guarantees the convergence to the global optimum in stochastic environments. Empirically, BQL outperforms baselines in a variety of multi-agent tasks.

\begin{acknowledgements} 
    This work was supported by NSFC under Grant 62476008. The authors would like to thank the anonymous reviewers for their valuable comments and advice.
\end{acknowledgements}

% References
\bibliography{uai2025-template}

\newpage

\onecolumn

\appendix
\section{Comparison with Hysteretic IQL}
\label{appendix:hysteretic}
Hysteretic IQL is a special case of BQL when the environment is deterministic. To thoroughly illustrate that, we rewrite the loss function of BQL
\begin{align*}
w(s, a_i)\left(Q_i\left(s, a_i\right)- \mathbb{E}_{\tilde{P}_i(s'|s,a_i)}\left [r + \gamma \underset{a'_{i}}{\max}Q_i(s',a'_{i}) \right ]  \right)^2, \\
w(s, a_i) = \left\{\begin{matrix}
1 & \text{if }  \mathbb{E}_{\tilde{P}_i(s'|s,a_i)}\left [r + \gamma \underset{a'_{i}}{\max}Q_i(s',a'_{i}) \right ]   > Q_i\left(s, a_i\right)\\ 
\lambda & \text{else}.
\end{matrix}\right.
\setlength{\belowdisplayskip}{3pt}
\end{align*}
If $\lambda = 0$, the update of BQL is
$$Q_i(s,a_i) = \max\left(Q_i(s,a_i),\mathbb{E}_{\tilde{P}_i(s'|s,a_i)}\left [r + \gamma \underset{a'_{i}}{\max}Q_i(s',a'_{i}) \right ] \right).$$
Hysteretic IQL follows the loss function
\begin{align*}
w(s, a_i)\left(Q_i\left(s, a_i\right)- r - \gamma \underset{a'_{i}}{\max}Q_i(s',a'_{i})  \right)^2, \\
w(s, a_i) = \left\{\begin{matrix}
1 & \text{if }  r + \gamma \underset{a'_{i}}{\max}Q_i(s',a'_{i})  > Q_i\left(s, a_i\right)\\ 
\lambda & \text{else}.
\end{matrix}\right.
\setlength{\belowdisplayskip}{3pt}
\end{align*}
If $\lambda = 0$, Hysteretic IQL degenerates into Distributed IQL \citep{lauer2000algorithm}
$$Q_i(s,a_i) = \max\left(Q_i(s,a_i), r + \gamma \underset{a'_{i}}{\max}Q_i(s',a'_{i}) \right).$$
BQL takes the max of the expected target on transition probability $\tilde{P}_i(s'|s,a_i)$, while Hysteretic IQL takes the max of the target on the next state $s'$. When the environment is deterministic, they are equivalent. However, in stochastic environments, Hysteretic IQL cannot guarantee to converge to the global optimum since the environment will not always transition to the same $s'$. BQL can guarantee the global optimum in both deterministic and stochastic environments.


\section{Efficiency of BQL}
\label{appendix:buffer}
\begin{figure*}[h]
	\centering
	\setlength{\abovecaptionskip}{3pt}
	\includegraphics[width=0.4\linewidth]{figure/space.pdf}
	\caption{Space of other agents' policies $\boldsymbol{\pi}_{-i}$ given an $(s,a_i)$.}	
	\label{fig:space}	
	
\end{figure*}

\begin{figure*}[t]
	\centering
	\setlength{\abovecaptionskip}{3pt}
	\includegraphics[width=0.6\linewidth]{figure/buffer_num.pdf}
	\caption{Toy case for illustrating the ideal buffer number. $|\mathcal{S}| = 3$, $|\mathcal{A}_{i}|=2$, and $|\mathcal{A}_{-i}| =2$ corresponding to $P_i^1$ and $P_i^2$. We can see that any $P_i(s,a_i)$ can be found in the 4 buffers.}	
	\label{fig:buff_num}
\end{figure*}

\begin{figure*}[t]
	\centering
	\setlength{\abovecaptionskip}{3pt}
	\includegraphics[width=0.4\linewidth]{figure/p5.pdf}
	\caption{Learning curves of BQL and joint Q-learning (JQL). BQL shows similar sample efficiency to JQL.}	
	\label{fig:jql}	
\end{figure*}

We will discuss the efficiency of collecting the replay buffer for BQL. The space of other agents' policies $\boldsymbol{\pi}_{-i}$ given $(s,a_i)$ pair is a convex polytope. For clarity, Figure~\ref{fig:space} shows a triangle space. Each $\boldsymbol{\pi}_{-i}$ corresponds to a $P_i(s'|s,a_i)$. Deterministic policies $\boldsymbol{\pi}_{-i}$ locate at the vertexes, while the edges and the inside of the polytope are stochastic $\boldsymbol{\pi}_{-i}$, the mix of deterministic ones. Since BQL only considers deterministic policies, the buffer series only needs to cover all the vertexes by acting deterministic policies in the collection of each buffer $\mathcal{D}_i^m$, which is efficient. BQL needs only $|\mathcal{A}_{i}| \times |\mathcal{A}_{-i}| = |\mathcal{A}|$ small buffers, which is irrelevant to state space $|\mathcal{S}|$, to meet the requirement of simplified best possible operator that any one of possible $P_i(s'|s,a_i)$ can be found in one (ideally only one) buffer given $(s,a_i)$ pair. More specifically, $|\mathcal{A}_{i}|$ buffers are needed to cover action space, and $|\mathcal{A}_{-i}|$ buffers are needed to cover transition space for each action. We intuitively illustrate this in Figure~\ref{fig:buff_num}. Each state in $\mathcal{D}_i^m$ requires $\#$ samples to estimate the expectation in (\ref{eq:bql-1}), so the sample complexity is $O(|\mathcal{A}||S|\#)$. For the joint Q-learning~(\ref{eq:joint_q}), the most efficient known method to guarantee the convergence and optimality in stochastic environments, each state-joint action pair $(s,\boldsymbol{a})$ requires $\#$ samples to estimate the expectation, so the sample complexity is also $O(|\mathcal{A}||S|\#)$. Thus, BQL is close to the joint Q-learning in terms of sample complexity, which is empirically verified in Figure~\ref{fig:jql}.


One may ask ``since you obtain all possible transition probabilities, why not perform IQL on each transition probability and choose the highest value?'' Actually, this naive algorithm can also learn the optimal policy, but the buffer collection of the naive algorithm is much more costly than that of BQL. The naive algorithm requires that any one of possible \textit{transition probability functions of the whole state-action space} could be found in one buffer, which needs $|\mathcal{A}_{-i}|^{|\mathcal{S}|}$ buffers. And training IQL $|\mathcal{A}_{-i}|^{|\mathcal{S}|}$ times is also formidable. BQL only requires that any one of possible \textit{transition probability of any state-action pair} could be found in one buffer, which is much more efficient.

However, considering sample efficiency, BQL with neural networks only maintains one replay buffer $\mathcal{D}_i$ containing all historical experiences, which is the same as IQL. $P_i$ in $\mathcal{D}_i$ corresponds to the average of other agents' historical policies, which is stochastic. Therefore, to guarantee the optimality, in theory, BQL with one buffer has to go through almost the whole $\boldsymbol{\pi}_{-i}$ space, which is costly. As shown in Figure~\ref{fig:matrix}, BQL- (with one buffer) outperforms IQL but cannot achieve similar results as BQL (with buffer series), showing that maintaining one buffer is costly but still effective. In neural network instantiation, we show the results of BQL with the buffer series in Figure~\ref{fig:buffer}. Due to sample efficiency, the buffer series cannot achieve strong performance, and maintaining one buffer like IQL is a better choice in complex environments.

\begin{figure*}[t]
	\centering
	\setlength{\abovecaptionskip}{3pt}
	\begin{subfigure}[t]{0.35\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/buffer-1.pdf}
		\caption{MPE, $\beta = 0.4$}		
	\end{subfigure}
	\begin{subfigure}[t]{0.35\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/buffer-2.pdf}
		\caption{$2 \times 3$ Swimmer}	
	\end{subfigure}
	\caption{BQL with one buffer and buffer series.}
	\label{fig:buffer}		
	
\end{figure*}

\begin{figure*}[h]
	\centering
	\setlength{\abovecaptionskip}{3pt}
	\begin{subfigure}[t]{0.35\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/baseline-1.pdf}
		\caption{MPE, $\beta = 0.4$}	
	\end{subfigure}
	\begin{subfigure}[t]{0.35\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/baseline-2.pdf}
		\caption{$2 \times 4$d Ant}	
	\end{subfigure}
	\caption{Learning curves of other base algorithms.}
	\label{fig:baseline}		
	
\end{figure*}

\section{Other Base Algorithms}

Besides DDPG, BQL could also be built on other variants of Q-learning, \textit{e.g.}, SAC. Figure~\ref{fig:baseline} shows that BQL could also obtain performance gain on independent SAC. Independent PPO (IPPO) \citep{de2020independent} is an on-policy decentralized MARL baseline. IPPO is not a Q-learning method so it cannot be the base algorithm of BQL. On-policy algorithms do not use old experiences, which makes them weak on sample efficiency \citep{SpinningUp2018} especially in fully decentralized settings as shown in Figure~\ref{fig:baseline}. Thus, it is unfair to compare off-policy algorithms with on-policy algorithms.



\begin{figure*}[h]
	\centering
	\setlength{\abovecaptionskip}{3pt}
	\begin{subfigure}[t]{0.35\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\centering
		\raisebox{0.3\height}{\includegraphics[width=1\linewidth]{figure/game3.pdf}}
		\caption{matrix game}		
	\end{subfigure}
	\begin{subfigure}[t]{0.35\linewidth}
		\setlength{\abovecaptionskip}{0pt}
		\includegraphics[width=1\linewidth]{figure/coordination.pdf}
		\caption{learning curves}		
	\end{subfigure}
	\caption{Learning curves on a one-stage matrix game with multiple optimal joint policies.}
	\label{fig:coordination}
	
\end{figure*}

\section{Multiple Optimal Joint Policies}
\label{appendix:coordination}

We assume that there is only one optimal joint policy. With multiple optimal actions (with the max $Q_i(s,a_i)$), if each agent arbitrarily selects one of the optimal independent actions, the joint action might not be optimal. To address this, we use the simple technique proposed in I2Q \citep{jiangi2q}. Concretely, we set a performance tolerance $\varepsilon$ and introduce a fixed randomly initialized reward function $\hat{r}(s,s') \in (0,(1-\gamma)\varepsilon]$. Then all agents perform BQL to learn $\hat{Q}_i(s,a_i)$ of the shaped reward $r + \hat{r}$. Since $\hat{r}>0$, $\hat{Q}_i(s,a_i) > Q_i(s,a_i)$. In $\hat{Q}_i(s,a_i)$, the maximal contribution from $\hat{r}$ is $(1-\gamma)\varepsilon/(1-\gamma) = \varepsilon$, so the minimal contribution from $r$ is $\hat{Q}_i(s,a_i) - \varepsilon > Q_i(s,a_i) - \varepsilon$, which means that the maximal performance drop is $\varepsilon$ when selecting actions according to $\hat{Q}_i$. It is a small probability event to find multiple optimal joint policies on the reward function $r + \hat{r}$, because $\hat{r}(s,s')$ is randomly initialized. Thus, if $\varepsilon$ is set to be small enough, BQL can solve the task with multiple optimal joint policies. However, this technique is introduced to only remedy the assumption for theoretical results. Empirically, this is not required, because there is usually only one optimal joint policy in complex environments. In all experiments, we do not use the randomly initialized reward function for BQL and other baselines, so the comparison is fair.

We test the randomly initialized reward function on a one-stage matrix game with two optimal joint policies $(1,2)$ and $(2,1)$, as shown in Figure~\ref{fig:coordination}. If the agents independently select actions, they might choose the miscoordinated joint policies $(1,1)$ and $(2,2)$. IQL cannot converge, but BQL agents always select coordinated actions, though the value gap between the optimal policy and suboptimal policy is so small, which verifies the effectiveness of the randomly initialized reward.

\begin{figure*}[h]
	\centering
	\setlength{\abovecaptionskip}{3pt}
	\includegraphics[width=0.3\linewidth]{figure/reward-density.pdf}
	\caption{Curves of reward and density of $l=\sqrt{\frac{2}{3}\sum_{i=0}^{3}x_i^2}$ in MPE. We plot the density of uniform state distribution. There is only one global optimum, but the density of local optimum is high. So decentralized agents will easily learn the local optimal policies.}	
	\label{fig:density}	
\end{figure*}

\section{Hyperparameters}
\label{appendix:hyperparameters}

In MPE-based (MIT license) differential games, the relationship between $r$ and $l$ is visualized in Figure~\ref{fig:density}.

In $2 \times 3$ Swimmer, there are two agents and each of them controls $3$ joints of ManyAgent Swimmer. In $6|2$ Ant, there are two agents. One of them controls $6$ joints, and one of them controls $2$ joints. And so on.

In MPE-based differential games and Multi-Agent MuJoCo (MIT license), we adopt SpinningUp \citep{SpinningUp2018} implementation (MIT license), the SOTA implementation of DDPG, and follow all hyperparameters in SpinningUp. The discount factor $\gamma = 0.99$, the learning rate is $0.001$ with Adam optimizer, the batch size is $100$, the replay buffer contains $5 \times 10^5$ transitions, the hidden units are $256$.

In SMAC (MIT license), we adopt PyMARL \citep{samvelyan19smac} implementation and follow all hyperparameters in PyMARL (Apache-2.0 license). The discount factor $\gamma = 0.99$, the learning rate is $0.0005$ with RMSprop optimizer, the batch size is $32$ episodes, the replay buffer contains $5000$ episodes, the hidden units are $64$. We adopt the version SC2.4.10 of SMAC.

In GRF (Apache-2.0 license), we adopt PyMARL2 \citep{hu2021rethinking} implementation (Apache-2.0 license) and follow all hyperparameters in PyMARL2. The discount factor $\gamma = 0.999$, the learning rate is $0.0005$ with Adam optimizer, the batch size is $128$ episodes, the replay buffer contains $2000$ episodes, the hidden units are $256$. We use simple115 feature (a 115-dimensional vector summarizing many aspects of the game) as observation instead of RGB image.

In MPE-based differential games, we set $\lambda=0.01$. In Multi-Agent MuJoCo, we set $\lambda=0.5$, and in SMAC, we set $\lambda=0.85$ for 2c\_vs\_64zg and $\lambda=0.8$ for other tasks. In GRF, we set $\lambda=0.1$ for 3\_vs\_1 with keeper and $\lambda=0.4$ for counterattack easy.

The experiments are carried out on Intel i7-8700 CPU and NVIDIA GTX 1080Ti GPU. The training of each MPE, MuJoCo, and GRF task could be finished in 5 hours, and the training of each SMAC task could be finished in 20 hours. 

\end{document}
