%\documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
% version; also before submission to
% see how the non-anonymous paper
% would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
% ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

\usepackage{xr}


\externaldocument{mondal_582-supp}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
\bibliographystyle{plainnat}
\renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
%\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{subcaption}
\newtheorem{assumption}{Assumption}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{corollary}{Corollary}
\newtheorem{example}{Example}
\newtheorem{remark}{Remark}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Can Mean Field Control (MFC) Approximate Cooperative Multi Agent Reinforcement Learning (MARL) with Non-Uniform Interaction?}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1, 2]{\href{mailto:<wmondal@purdue.edu>?Subject=Your UAI 2022 paper}{Washim~Uddin~Mondal}{}}
\author[1]{Vaneet Aggarwal}
\author[2]{Satish V. Ukkusuri}
% Add affiliations after the authors
\affil[1]{%
	School of Industrial Engineering\\
	Purdue University\\
	West Lafayette, Indiana, USA 47907
}
\affil[2]{%
	Lyles School of Civil Engineering\\
	Purdue University\\
	West Lafayette, Indiana, USA 47907
}

\begin{document}
	\maketitle
	
\begin{abstract}
	Mean-Field Control (MFC) is a powerful tool to solve Multi-Agent Reinforcement Learning (MARL) problems. Recent studies have shown that MFC can well-approximate MARL when the population size is large and the agents are exchangeable. Unfortunately, the presumption of exchangeability implies that all agents uniformly interact with one another which is not true in many practical scenarios. In this article, we relax the assumption of exchangeability and model the interaction between agents via an arbitrary doubly stochastic matrix. As a result, in our framework, the mean-field `seen' by different agents are different. We prove that, if the reward of each agent is an affine function of the mean-field seen by that agent, then one can approximate such a non-uniform MARL problem via its associated MFC problem within an error of $e=\mathcal{O}(\frac{1}{\sqrt{N}}[\sqrt{|\mathcal{X}|} + \sqrt{|\mathcal{U}|}])$ where $N$ is the population size and $|\mathcal{X}|$, $|\mathcal{U}|$ are the sizes of state and action spaces respectively. Finally, we develop a Natural Policy Gradient (NPG) algorithm that can provide a solution to the non-uniform MARL with an error $\mathcal{O}(\max\{e,\epsilon\})$ and a sample complexity of $\mathcal{O}(\epsilon^{-3})$ for any $\epsilon >0$.
\end{abstract}
	
	\section{Introduction}\label{sec:intro}
	
	Multi-Agent Systems (MAS) are ubiquitous in the modern world. Many engineered systems such as transportation networks, power distribution and wireless communication systems can be modeled as MAS. Modeling, analysis and control of such systems to improve the overall performance is a central goal of research across multiple disciplines. Multi-Agent Reinforcement Learning (MARL) is a popular approach to achieve that target. In this article, we primarily focus on \textit{cooperative} MARL where the goal is to determine \textit{policies} for each individual agent such that the aggregate cumulative \textit{reward} of the entire population is maximized. However, the sizes of joint state, and action spaces of the population grows exponentially with the number of agents. This makes the computation of the solution prohibitively hard for large MAS. 
	
	Two major computationally efficient approaches have been developed to tackle this problem. The first approach restricts its attention to local policies. In other words, it is assumed that each individual agent makes its decision solely based on its local state/observation. Algorithms that fall into this category are independent Q-learning (IQL) \citep{tan1993multi}, centralised training and decentralised execution (CTDE) based algorithms such as VDN \citep{sunehag2017value}, QMIX \citep{rashid2018qmix}, WQMIX \citep{rashid2020weighted}, etc. Unfortunately, none of these algorithms can provide theoretical convergence guarantees. The other approach is called mean-field control (MFC) \citep{angiuli2022unified}. It is grounded on the idea that in an infinite population of homogeneous agents, it is sufficient to study the behaviour of only one representative agent in order to draw accurate conclusions about the whole population. Recent studies have shown that, if the agents are exchangeable, then MFC can be proven to be a good approximation of MARL \citep{gu2021mean}.
	
	Unfortunately, the idea of exchangeability essentially states that all agents in a population uniformly interact with each other (uniform means that all pairwise interactions are the same). This is not true in many practical scenarios. For example, in a traffic control network, the congestion at an intersection is highly influenced by the control policies adopted at its immediate neighbouring intersections. Moreover, the influence of an intersection on another intersection rapidly diminishes with increase of their separation distance. Non-uniform interaction is a hallmark characteristic of many other MASs such as social networks, wireless networks etc. In the absence of uniformity of the interaction between the agents, the framework of MFC no longer applies, and the problem becomes challenging. In this paper, we come up a new result which assures that even with non-uniform interactions, MFC is a good choice for approximating MARL  if the reward of each agent is an affine function of the mean-field distributions `seen' by that agent. We note that the behaviour of agents in multitude of social and economic networks can be modeled via affine rewards (refer the examples given in \citep{chen2021agent}), and thus for many cases of practical interest, MFC can approximate MARL with non-uniform interactions. %under certain widely applicable assumptions. 
	
	
	
	\subsection{Contributions}
	\label{sec:contributions}
	
	We  consider a non-uniform MARL setup where the pairwise interaction between the agents is described by an arbitrary doubly stochastic matrix (DSM). As a result of non-uniform interaction, the so-called mean-field effect of the population on an agent is determined by the identity of the agent. This is in stark contrast with other existing works \citep{gu2021mean, mondal2021approximation} where the presumption of exchangeability washes away the dependence on identity.	
	 We demonstrate that, if the reward of each agent is an affine function of the mean-field distribution `seen' by that agent, then the standard MFC approach can approximate the non-uniform MARL with an error bound of $e\triangleq\mathcal{O}(\frac{1}{\sqrt{N}}[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}])$, where $N$ is the number of agents and $|\mathcal{X}|, |\mathcal{U}|$ indicate the sizes of state and action spaces of individual agent.
	
	We would like to emphasize the importance of this result. MFC is traditionally seen as an approximation method of MARL when the agents are exchangeable and hence their interactions are uniform. Uniformity allows us to solve MFC problems by tracking only one representative agent. In this paper, we show that, under certain conditions, a non-uniform MARL can also be approximated by the MFC approach. Thus, although the non-uniform interaction is a major part of the original MARL problem, the assumed affine structure of the reward function allows us to evade non-uniformity while obtaining an approximate solution. The key result is established in Lemma \ref{lemma_7} (Appendix \ref{appndx:approx}) where, using the affine structure of the reward function, we show that the instantaneous reward generated from non-uniform MARL can be closely approximated by MFC-generated instantaneous reward.
	
	Finally, using the results of \citep{liu2020improved}, in section \ref{sec:npg}, we design a natural policy-gradient based algorithm that can solve MFC within an error of $\mathcal{O}(\epsilon)$ for any $\epsilon >0$, with a sample complexity of $\mathcal{O}(\epsilon^{-3})$. Invoking our approximation result, we prove that the devised algorithm can yield a solution that is $\mathcal{O}(\max\{e, \epsilon\})$ error away from the optimal MARL solution, with a sample complexity of $\mathcal{O}(\epsilon^{-3})$ for any $\epsilon>0$.
	
	\subsection{Related Works}
	\label{sec:related_works}
	{\bf Single Agent RL:} The classical algorithms in single agent learning include tabular Q-learning \cite{watkins1992q}, SARSA \cite{rummery1994line}, etc. Although they provide theoretical guarantees, these algorithms can only be applied to small state-action space based systems due to their large memory requirements. Recently Neural Network (NN) based  $Q$-iteration \cite{mnih2015human}, and policy gradient \cite{mnih2016asynchronous} algorithms have becomes popular due to the large expressive power of NN. However, they cannot be applied to large MAS due to the exponential blow-up of joint state-space. %Additionally, these algorithms do not provide any theoretical guarantee.
	
	{\bf MFC as an Approximation to Uniform MARL:} Recently, MFC is gaining traction as a scalable approximate solution to uniform MARL. On the theory side, recently it has been proven that MFC can approximate uniform MARL within an error of $\mathcal{O}(1/\sqrt{N})$ \citep{gu2021mean}. However, the result relies on the assumption that all agents are homogeneous. Later, this approximation result was extended to heterogeneous agents \citep{mondal2021approximation}. We would like to clarify that the idea of heterogeneity is different from the idea of non-uniformity. In the first case, the agents are divided into multiple classes. However, the identities of different agents within a given class are irrelevant. In contrast, non-uniform interaction takes the identity of each agent into account.
	
	{\bf Graphon Approximation:} One possible approach to consider non-uniform agent interaction is the notion of Graphon mean-field, which is recently gaining popularity in the \textit{non-cooperative} MARL setup \citep{caines2019graphon,cui2021learning}. The main idea is to approximate the finite indices of the agents as a continuum of real numbers and the discrete interaction graph between agents as a continuous, symmetric, measurable function, called graphon, in the asymptotic limit of infinite population. The unfortunate consequence of this approximation is that one is left to deal with an infinite dimensional mean-field distribution. In order to obtain practical solution from graphon-approximation, one must therefore discretise the continuum of agent indices \citep{cui2021learning}, which limits the use of this approximation. Our paper establishes that for affine reward functions, we do not need to go to the complexity of Graphon approximation.  
	
	{\bf Applications of MFC:} Alongside the theory, MFC has also become popular as an application tool. It has been used in ride-sharing \citep{al2019deeppool}, epidemic management \citep{watkins2016optimal}, congestion control in road network \citep{wang2020large} etc.
	
	{\bf Learning Algorithms for MFC:} Both model-free \citep{angiuli2022unified, gu2021mean} and model-based \citep{pasztor2021efficient} Q-learning algorithms have been proposed in the literature to solve uniform MARL via MFC with homogeneous agents. Recently, \citep{mondal2021approximation} proposed a policy-gradient algorithm for heterogeneous-MFC.
	
	

	
	
	
	%This circular exercise thus raises questions about the benefits of graphon mean-field approach.
	\section{Cooperative MARL with Non-Uniform Interaction}\label{sec:cooperativeMARL}
	We consider a system comprising of $N$ interacting agents. The (finite) state and action spaces of each agent are denoted as $\mathcal{X}$, and $\mathcal{U}$ respectively. Time is assumed to belong to the discrete set, $\mathbb{T}\triangleq\{0,1,2,\cdots\}$. The state and action of $i$-th agent at time $t$ are symbolized as $x_t^i$ and $u_t^i$. The empirical state and action distributions of the population of agents at time $t$ are denoted by $\boldsymbol{\mu}_t^N$, and $\boldsymbol{\nu}_t^N$ respectively, and defined as follows.
	\begin{align}
		\boldsymbol{\mu}_t^N(x) \triangleq \dfrac{1}{N}\sum_{i=1}^N \delta(x_t^i=x), ~\forall x\in \mathcal{X},\forall t\in\mathbb{T}
		\label{mu}
		\\
		\boldsymbol{\nu}_t^N(u) \triangleq \dfrac{1}{N} \sum_{i=1}^N \delta(u_t^i=u), ~\forall u\in \mathcal{U}, \forall t\in\mathbb{T}
		\label{nu}
	\end{align}
	where $\delta(\cdot)$ is the indicator function.
	
	Each agent, $i\in [N] \triangleq\{1,\cdots, N\}$ is endowed with a reward function $r$ and a state transition function $P$ that are of the following forms: $r:\mathcal{X}\times \mathcal{U} \times \mathcal{P}(\mathcal{X})\times \mathcal{P}(\mathcal{U})\rightarrow \mathbb{R}$ and $P:\mathcal{X}\times \mathcal{U} \times \mathcal{P}(\mathcal{X})\times \mathcal{P}(\mathcal{U})\rightarrow \mathcal{P}(\mathcal{X})$ where $\mathcal{P}(\cdot)$ is the set of all Borel probability measures over its argument. In particular, $r, P$ take the followings as arguments: (a) the state, $x_t^i$ and action, $a_t^i$ of the corresponding agent and (b) the weighted state distribution $\boldsymbol{\mu}_t^{i,N}$ and the weighted action distribution $\boldsymbol{\nu}_t^{i,N}$ of the population as seen from the perspective of the agent. The terms $\boldsymbol{\mu}_t^{i,N}$ and $\boldsymbol{\nu}_t^{i,N}$ are defined as follows. 
	\begin{align}
		\boldsymbol{\mu}_t^{i,N}(x) \triangleq \sum_{j=1}^N W(i,j)\delta(x_t^j=x), ~\forall x\in \mathcal{X},\forall t\in \mathbb{T}
		\label{mu_i}
		\\
		\boldsymbol{\nu}_t^{i,N}(u) \triangleq \sum_{j=1}^N W(i,j)\delta(u_t^j=u), ~\forall u\in \mathcal{U}, \forall t\in \mathbb{T}
		\label{nu_i}
	\end{align}
	
	The function $W:[N]\times [N]\rightarrow [0,1]$ dictates the influence of one agent on another. In particular, $W(i, j)$ specifies how $j$-th agent influences $i$-th agent's reward and transition functions. Observe that, for $\boldsymbol{\mu}_t^{i,N}$, and $\boldsymbol{\nu}_t^{i,N}$ to be probability distributions, $W$ must be right-stochastic i.e.,
	\begin{align}
		\sum_{j=1}^N W(i,j) = 1, ~\forall i\in \{1,\cdots,N\}
		\label{W_sum_j}
	\end{align}
	
	In summary, the reward received by the $i$-th agent at time $t$ can be expressed as $r(x_t^i, u_t^i, \boldsymbol{\mu}_t^{i,N}, \boldsymbol{\nu}_t^{i,N})$. Moreover, the state of the agent at time $t+1$ is decided by the following probability law: $x_{t+1}^i\sim P(x_t^i, u_t^i, \boldsymbol{\mu}_t^{i,N}, \boldsymbol{\nu}_t^{i,N})$. We would like to point out that, in contrast to our framework, existing works assume reward and state transition to be functions of $\boldsymbol{\mu}_t^N, \boldsymbol{\nu}_t^N$, thereby making the influence of population to be identical for every agent \citep{mondal2021approximation, gu2021mean}. If we take the influence function $W$ to be uniform i.e., $W(i, j)= 1/N$, $\forall i, j \in [N]$, then $\forall i\in [N]$, $\boldsymbol{\mu}_t^{i,N}=\boldsymbol{\mu}_t^N$, and $\boldsymbol{\nu}_t^{i,N}=\boldsymbol{\nu}_t^N$,  which forces our framework to collapse onto that described in the above mentioned papers.
	
	At time $t\in \mathbb{T}$, each agent is also presumed to have a policy function $\pi_t: \mathcal{X} \times \mathcal{P}(\mathcal{X})\rightarrow \mathcal{P}(\mathcal{U})$ that maps $(x_t^i, \boldsymbol{\mu}_t^{i,N})$ to a distribution over the action space, $\mathcal{U}$. In simple words, a policy function $\pi_t$ is a rule that (probabilistically) dictates what action must be chosen by an agent given its current state and the mean-distribution of the population as observed by the agent. Note that the policy function is presumed to be the same for all the agents as the reward function, $r$ and the transition function, $P$ is taken to homogeneous across the population. Homogeneity of $r, P$ is a common assumption in the mean-field literature \citep{gu2021mean, vasal2021sequential}.
	
	For a given set of initial states $\boldsymbol{x}_0\triangleq \{x_0^i\}_{i\in\mathbb{N}}$, the value of the sequence of policies, $\boldsymbol{\pi}\triangleq\{\pi_t\}_{t\in \mathbb{T}}$, for the $i$-th agent is defined as follows.
	\begin{align}
		v_i(\boldsymbol{x}_0, \boldsymbol{\pi}) \triangleq \sum_{t\in \mathbb{T}} \gamma^t \mathbb{E}\left[ r\left(x_t^i, u_t^i, \boldsymbol{\mu}_t^{i,N}, \boldsymbol{\nu}_t^{i,N}\right) \right]
	\end{align}
	where $\boldsymbol{\mu}_t^{i,N}, \boldsymbol{\nu}_t^{i,N}$ are defined by $(\ref{mu_i}), (\ref{nu_i})$, respectively, and the expectation is computed over all the state-action trajectories generated by the transition function $P$ and the sequence of policy functions, $\boldsymbol{\pi}$. The term, $\gamma\in [0, 1]$ is called the time discount factor. We would like to emphasize that the value function $v_i$ is dependent on the interaction matrix $W$ (because so are $\boldsymbol{\mu}_t^{i,N}$, and $\boldsymbol{\nu}_t^{i,N}$). However, such dependence is not explicitly shown to keep the notation uncluttered. The average value function of the entire population is expressed as below.
	\begin{align}
		v_{\mathrm{MARL}}(\boldsymbol{x}_0, \boldsymbol{\pi}) = \dfrac{1}{N} \sum_{i=1}^N v_i(\boldsymbol{x}_0, \boldsymbol{\pi})
		\label{v_MARL}
	\end{align}  
	
 The goal of MARL is to maximize $v_{\mathrm{MARL}}(\boldsymbol{x}_0, .)$ over all policy sequences $\boldsymbol{\pi}$. Such an optimization is hard to solve in general, especially for large $N$.
	
	Before concluding this section, we would like to point out two important observations that will be extensively used in many of our forthcoming results.
	
	\begin{remark} $\forall t\in \mathbb{T}$, the random variables $\{u_t^i\}_{i\in[N]}$ are conditionally independent given $\{x_t^i\}_{i\in[N]}$. In other words, given current states, each agent chooses its action independent of each other.
	\end{remark}
	
	\begin{remark}
		$\forall t\in \mathbb{T}$, the random variables $\{x_{t+1}^i\}_{i\in[N]}$ are conditionally independent given $\{x_t^i\}_{i\in[N]}$, and $\{u_t^i\}_{i\in[N]}$. In other words, given current states and actions, the next state of each agent evolves independent of each other.
	\end{remark}
	
	\section{Mean-Field Control}
	\label{sec:MFC}
	
	MFC is an approximation method of $N-$agent MARL that takes away many of the complexities of the later. The main idea of MFC is to consider an infinite population of homogeneous agents, instead of a finite population as considered in MARL. The advantage of such presumption is that it allows us to draw accurate inferences about the whole population by tracking only a single representative agent. Unfortunately, as stated before, such approximation method is known to work \citep{gu2021mean} when the interactions between different agents are uniform, i.e., $W(i, j)=1/N$, $\forall i,j \in [N]$. In this article, we shall show that, under certain conditions, we can show MFC as an approximation of MARL, even with non-uniform $W$. Below we describe the MFC method.
	
	As explained above, in MFC, we only need to track a single representative agent. Let the state and action of the agent at time $t$ be denoted as $x_t$, and $u_t$ respectively. Also, let $\boldsymbol{\mu}_t$, $\boldsymbol{\nu}_t$ be the state, and action distributions of the infinite population at time $t$. The reward and state transition laws of the representative at time $t$ are denoted as $r(x_t,u_t,\boldsymbol{\mu}_t,\boldsymbol{\nu}_t)$ and $P(x_t,u_t,\boldsymbol{\mu}_t,\boldsymbol{\nu}_t)$, respectively. For a given policy sequence $\boldsymbol{\pi}\triangleq \{\pi_t\}_{t\in \mathbb{T}}$, the action distribution $\boldsymbol{\nu}_t$ can be expressed as a deterministic function of the state distribution, $\boldsymbol{\mu}_t$ as follows.
	\begin{align}
		\boldsymbol{\nu}_t = \nu^{\mathrm{MF}}(\boldsymbol{\mu}_t, \pi_t) \triangleq \sum_{x\in \mathcal{X}} \pi_t(x, \boldsymbol{\mu}_t)\boldsymbol{\mu}_t(x)
		\label{nu_MF}
	\end{align}
	
	In a similar fashion, the state distribution at time $t+1$ can be written as a deterministic function of $\boldsymbol{\mu}_t$ as follows.
	\begin{align}
		\begin{split}
			\boldsymbol{\mu}_{t+1} &=  P^{\mathrm{MF}}(\boldsymbol{\mu}_t,\pi_t)\\ 
			&\triangleq 
			\sum_{x\in \mathcal{X}}\sum_{u\in\mathcal{U}} P(x, u, \boldsymbol{\mu}_t, \nu^{\mathrm{MF}}(x, \boldsymbol{\mu}_t))\\
			& \hspace{2cm}\times\pi_t(x, \boldsymbol{\mu}_t)(u)\boldsymbol{\mu}_t(x)
		\end{split}
		\label{mu_t_plus_1}
	\end{align}
	
	For an initial state distribution $\boldsymbol{\mu}_0$, the value of a sequence of policies $\boldsymbol{\pi}\triangleq \{\pi_t\}_{t\in \mathbb{T}}$, is defined as written below.
	\begin{align}
		\begin{split}
			&v_{\mathrm{MF}}(\boldsymbol{\mu}_0, \boldsymbol{\pi}) \triangleq \sum_{t\in \mathbb{T}} \gamma^t r^{\mathrm{MF}}\left(\boldsymbol{\mu}_t, \pi_t\right),\\
			\text{where}~& r^{\mathrm{MF}}(\boldsymbol{\mu}_t, \pi_t) \triangleq \sum_{x\in \mathcal{X}}\sum_{u\in\mathcal{U}} r(x, u, \boldsymbol{\mu}_t, \nu^{\mathrm{MF}}(\boldsymbol{\mu}_t, \pi_t))\\
			&\hspace{2.5cm}\times\pi_t(x, \boldsymbol{\mu}_t)(u)\boldsymbol{\mu}_t(x)
		\end{split}
		\label{v_MF}
	\end{align}
	
	The term $r^{\mathrm{MF}}(\boldsymbol{\mu}_t, \pi_t)$ indicates the average reward of the population. Alternatively, it can also be expressed as the ensemble average of the reward of the representative agent i.e., $r^{\mathrm{MF}}(\boldsymbol{\mu}_t, \pi_t) = \mathbb{E}[r(x_t, u_t, \boldsymbol{\mu}_t, \boldsymbol{\nu}_t)]$ where the expectation is computed over all possible states $x_t\sim \boldsymbol{\mu}_t$, and actions $u_t\sim \pi_t(x_t,\boldsymbol{\mu}_t)$ at time $t$. The mean distributions $\boldsymbol{\mu}_t, \boldsymbol{\nu}_t$ are sequentially determined by $(\ref{nu_MF}), (\ref{mu_t_plus_1})$ from a given initial state distribution, $\boldsymbol{\mu}_0$.
	
	
	The goal of MFC is to maximize $v_{\mathrm{MF}}(\boldsymbol{\mu}_0, \cdot)$ over all policy sequences. In the next section, we shall demonstrate that, under certain conditions, $v_{\mathrm{MARL}}$ is well-approximated by $v_{\mathrm{MF}}$. Therefore, in order to solve MARL, it is sufficient to solve its associated MFC.
	
	It is worthwhile to point out that $\boldsymbol{\mu}_t, \boldsymbol{\nu}_t$ can be thought of as limiting values of the empirical distributions $\boldsymbol{\mu}_t^N, \boldsymbol{\nu}_t^N$ in the asymptotic limit of infinite population. Note that, $\boldsymbol{\mu}_t^N, \boldsymbol{\nu}_t^N$ and thereby, $\boldsymbol{\mu}_t, \boldsymbol{\nu}_t$ are NOT dependent on $W$. This makes the MFC problem agnostic of $W$. In contrary, agents in the $N-$agent MARL problem are influenced by the weighted mean-field distribution $\{\boldsymbol{\mu}_t^{i,N}, \boldsymbol{\nu}_t^{i,N}\}_{i\in[N]}$ which do depend on $W$ via $(\ref{mu_i}), (\ref{nu_i})$.  Therefore, unlike in the existing works, the mean-field representative in our case cannot be described as a randomly chosen typical agent in the limit $N\rightarrow \infty$. The concept of mean-field representative, in our work, is a useful construct that, under certain conditions, can provide well-approximated solution to MARL. 
	
	In the next section, we describe how these seemingly incompatible frameworks, namely non-uniform MARL where the behaviour of agents are dependent on $W$, and the framework of $W$-agnostic MFC, can be merged together.
	
	\section{MFC as an approximation to Non-Uniform MARL}
	Before formally stating our main result, we would like to describe the assumptions that the result is grounded upon. Our first assumption is on the structure of state-transition function.
	
	\begin{assumption}
		The state-transition function $P$ is Lipschitz continuous with parameter $L_P$ with respect to the mean-distribution arguments. Mathematically, the inequality, 
		\begin{align*}
			|P(x, u, &\boldsymbol{\mu}_1, \boldsymbol{\nu}_1) - P(x, u, \boldsymbol{\mu}_2, \boldsymbol{\nu}_2)|_1 \\
			&\leq L_P[|\boldsymbol{\mu}_1-\boldsymbol{\mu}_2|_1 + |\boldsymbol{\nu}_1-\boldsymbol{\nu}_2|_1 ]
		\end{align*}
		holds $\forall x \in \mathcal{X}, \forall u\in \mathcal{U}$, $\forall \boldsymbol{\mu}_1,\boldsymbol{\mu}_2 \in \mathcal{P}(\mathcal{X})$ and $\forall \boldsymbol{\nu}_1, \boldsymbol{\nu}_2\in \mathcal{P}(\mathcal{U})$. The symbol $|\cdot|_1$ denotes $L_1$ norm.
		\label{assumption_1}
	\end{assumption}
	
	Assumption \ref{assumption_1} states that the transition function, $P$, is Lipschitz continuous with respect to its mean-field arguments. Essentially, this implies that if the state-distribution changes from $\boldsymbol{\mu}$ to $\boldsymbol{\mu} + \Delta \boldsymbol{\mu}$, then the corresponding change in the transition-function can be bounded by a term proportional to $|\Delta \boldsymbol{\mu}|_1$. Similar property holds for the change in the action-distribution. This useful assumption commonly appears in the mean-field literature \citep{gu2021mean, mondal2021approximation, carmona2018probabilistic}.
	
	The second assumption is on the structure of $r$, the reward function.
	\begin{assumption}
		The reward function, $r$ is affine with respect to  mean-distribution arguments. Mathematically, for some $\boldsymbol{a}\in \mathbb{R}^{|\mathcal{X}|}, \boldsymbol{b}\in\mathbb{R}^{|\mathcal{U}|}$, and $f:\mathcal{X}\times \mathcal{U}\rightarrow \mathbb{R}$, the equality,
		\begin{align*}
			r(x, u, \boldsymbol{\mu}, \boldsymbol{\nu}) = \boldsymbol{a}^T\boldsymbol{\mu} + \boldsymbol{b}^T\boldsymbol{\nu} + f(x, u)
		\end{align*}
		holds $\forall x\in \mathcal{X}$, $\forall u\in \mathcal{U}$, $\forall \boldsymbol{\mu}\in \mathcal{P}(\mathcal{X})$, and $\forall \boldsymbol{\nu}\in \mathcal{P}(\mathcal{U})$. 
		\label{assumption_2}
	\end{assumption}
	
	Assumption \ref{assumption_2} dictates that the reward is an affine function of the mean-field distributions. Although this assumption does not allow us to encapsulate a large variety of reward functions, we would like to point out that the behaviour of agents in multitude of social and economic networks can be modeled via affine rewards (refer the examples given in \citep{chen2021agent}). We shall provide one explicit example at the end of this section. We would also like to reiterate that the benefit of this seemingly restrictive assumption of affine reward is it allows us to apply the principles of MFC to an arbitrarily interacting $N$-agent system which is notoriously complex to solve in general.
	
	The immediate corollary of Assumption \ref{assumption_2} is that the reward function is bounded and Lipschitz continuous. The formal proposition is given below.
	
	\begin{corollary}
		If the reward function, $r$ satisfies Assumption $\ref{assumption_2}$, then for some $M_R, L_R>0$, the following holds
		\begin{align*}
			(a) &|r(x, u, \boldsymbol{\mu}_1, \boldsymbol{\nu}_1)|\leq M_R, \\ 
			(b) &|r(x, u, \boldsymbol{\mu}_1, \boldsymbol{\nu}_1) - r(x, u, \boldsymbol{\mu}_2, \boldsymbol{\nu}_2)| \\
			&\hspace{2cm} \leq 	L_R\left[|\boldsymbol{\mu}_1-\boldsymbol{\mu}_2|_1 + |\boldsymbol{\nu}_1-\boldsymbol{\nu}_2|_1\right]
		\end{align*}	
		$\forall x\in \mathcal{X}$, $\forall u\in \mathcal{U}$, $\forall \boldsymbol{\mu}_1,\boldsymbol{\mu}_2\in \mathcal{P}(\mathcal{X})$, and $\forall \boldsymbol{\nu}_1, \boldsymbol{\nu}_2\in \mathcal{P}(\mathcal{U})$. 
		\label{corollary_1}
	\end{corollary}
	
	
	
	The third assumption concerns the set of allowable policy functions.
	\begin{assumption}
		The set of allowable policy functions, $\Pi$, is such that each of its element is Lipschitz continuous with respect to its mean-state distribution argument. Mathematically, $\forall \pi\in \Pi$, the following inequality holds
		\begin{align*}
			|\pi(x, \boldsymbol{\mu}_1) - \pi(x, \boldsymbol{\mu}_2)|_1 \leq L_Q|\boldsymbol{\mu}_1-\boldsymbol{\mu}_2|_1
		\end{align*}
		for some $L_Q>0$ and $\forall x\in \mathcal{X}$, $\forall \boldsymbol{\mu}_1, \boldsymbol{\mu}_2\in \mathcal{P}(\mathcal{X})$.
		\label{assumption_3}
	\end{assumption}
	
	Assumption \ref{assumption_3} states that the allowable policy functions must be Lipschitz continuous with respect to its state-distribution argument. Such requirement typically holds for neural network based policies and are commonly presumed to be true in the literature \citep{gu2021mean, cui2021learning, pasztor2021efficient}. 
	
	The final assumption imposes some constraints on the interaction function, $W$.
	\begin{assumption}
		The interaction function, $W$ is such that, 
		\begin{align}
			\sum_{i=1}^N W(i, j) = 1, ~\forall j \in \{1, \cdots, N\}
		\end{align}
		In conjunction with $(\ref{W_sum_j})$, this assumption implies that $W$ is doubly-stochastic.
		\label{assumption_4}
	\end{assumption}
	
	Assumption $\ref{assumption_4}$ requires $W$ to be an $N\times N$ doubly stochastic matrix (DSM). Such presumption is commonly applied in many multi-agent tasks, e.g., distributed consensus \citep{alaviani2019distributedacc}, distributed optimization \citep{alaviani2019distributed}, and multi-agent learning \citep{wai2018multi}.
	
	We now state our main result.
	
	\begin{theorem}
		Let, $\boldsymbol{x}_0\triangleq \{x_0^i\}_{i\in [N]}$ be the initial states in an $N$-agent non-uniform MARL problem and $\boldsymbol{\mu}_0$ be its associated empirical distribution defined by $(\ref{mu})$. Assume $\Pi$ to be a set of policies that obeys Assumption \ref{assumption_3}, and $\boldsymbol{\pi}\triangleq\{\pi_t\}_{t\in\mathbb{T}}$ is a sequence of policies such that $\pi_t\in \Pi$, $\forall t\in \mathbb{T}$. If Assumption \ref{assumption_1}, \ref{assumption_2} and $\ref{assumption_4}$ hold, then 
		\begin{align}
			\begin{split}
				&|v_{\mathrm{MARL}}(\boldsymbol{x}_0, \boldsymbol{\pi}) - v_{\mathrm{MF}}(\boldsymbol{\mu}_0, \boldsymbol{\pi})| \leq  C_R \dfrac{\sqrt{|\mathcal{U}|}}{\sqrt{N}}\dfrac{1}{1-\gamma}\\
				&+ \dfrac{1}{\sqrt{N}}\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]\dfrac{S_RC_P}{S_P-1} \left[\dfrac{1}{1-\gamma S_P}-\dfrac{1}{1-\gamma}\right]
			\end{split}
		\end{align}
		whenever $\gamma S_P<1$ where $S_P\triangleq (1+L_Q)+L_P(2+L_Q)$, $S_R\triangleq M_R(1+L_Q) + L_R(2+L_Q)$, $C_P\triangleq 2+L_P$, and $C_R\triangleq |\boldsymbol{b}|_1 + M_F$. The parameters $L_P, \boldsymbol{b}, L_Q, L_R, M_R$ have been defined in Assumption \ref{assumption_1}, \ref{assumption_2}, \ref{assumption_3}, and Corollary \ref{corollary_1}, respectively. The term $M_F$ is such that $|f(x,u)|\leq M_F$, $\forall x\in \mathcal{X}$, $\forall u\in \mathcal{U}$ where $f$ is stated in Assumption \ref{assumption_2}. The functions $v_{\mathrm{MARL}}$, and $v_{\mathrm{MF}}$ are defined in $(\ref{v_MARL}), (\ref{v_MF})$ respectively.
		\label{theorem_1}
	\end{theorem}
	
	Theorem \ref{theorem_1} has an important implication. Specifically, it states that, if reward and transition functions respectively are affine and Lipschitz continuous functions of the mean-distributions, and the interaction between the agents is described by a DSM, then the solution of MFC is at most $\mathcal{O}(1/\sqrt{N})$ error away from the solution of the non-uniform MARL problem. Therefore, the larger the number of agents, the better is the MFC-based approximation. It also describes how the approximation error changes with the sizes of the state, and action spaces. Specifically, if all other parameters are kept fixed, then the error increases as $\mathcal{O}(\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|})$. In other words, if individual state and action spaces are large, then MFC may not be a good approximation to non-uniform MARL.
	
	Now we shall discuss one example where the reward, transition function and the interaction function satisfy Assumption \ref{assumption_1}, \ref{assumption_2}, and \ref{assumption_4} respectively.
	
	\begin{example}
		A version of this model has been adapted in \citep{subramanian2019reinforcement} and \citep{chen2021agent}. Consider a network of $N$ firms operated by a single operator. All of the firms produce the same product but with varying quality. A discrete set $\mathcal{X}\triangleq\{1,2,\cdots, Q\}$  (state-space) describes the possible levels of quality of the product. At each time instant, each firm decides whether to invest to improve the quality of its product which leads to the following action set: $\mathcal{U}=\{0, 1\}$. If at time $t$, the $i$-th firm decides to invest, i.e., $u_t^i=1$, its current quality, $x_t^i$, improves according to the following transition law.
		\begin{align*}
			x_{t+1}^i = 
			\begin{cases}
				x_t^i + \left\lfloor \chi\left(1-\dfrac{\bar{\boldsymbol{\mu}}_t^{i,N}}{Q}\right)  (Q-x_t^i)\right\rfloor~ \text{if}~ u_t^i = 1, \\
				x_t^i~~\hspace{4.4cm}\text{otherwise}
			\end{cases}
		\end{align*}
		where $\chi$ is a uniform random variable between $[0,1]$, and $\bar{\boldsymbol{\mu}}_t^{i,N}$ is average product quality of its $K<N$ neighbouring firms. The intuition is that improving product quality might be difficult if the quality maintained in the local economy is high. Formally, we assume that each firm equally influences and is influenced by $K$ other firms. Hence, $W(i,j)=1/K$ for all $i, j\in [N]$ that influence each other and $W(i,j)=0$ otherwise. The local average product quality is computed as, $\bar{\boldsymbol{\mu}}_t^{i,N} \triangleq \sum_{x\in \mathcal{X}} x\boldsymbol{\mu}_t^{i,N}(x)$ where $\boldsymbol{\mu}_t^{i,N}$ is given in $(\ref{mu_i})$. At time $t$, the $i$-th firm earns a positive reward, $\alpha_R x_t^i$ due to its revenue, a negative reward, $\beta_R\bar{\boldsymbol{\mu}}_t^{i,N}$ due to the average local quality, and a cost $\lambda_R u_t^i$ due to investment. Hence, the total reward can be expressed as follows.
		\begin{align*}
			r(x_t^i, u_t^i,\boldsymbol{\mu}_t^{i,N}, \boldsymbol{\nu}_t^{i,N}) = \alpha_R x_t^i - \beta_R\bar{\boldsymbol{\mu}}_t^{i,N} -  \lambda_R u_t^i
		\end{align*}
		
		Clearly, in this example, Assumption \ref{assumption_1}, \ref{assumption_2}, and \ref{assumption_4} are satisfied.
		\label{example_1}
	\end{example}
	
	
	\subsection{Proof Outline}
	In this subsection, we shall provide a brief sketch of the proof of Theorem \ref{theorem_1}.
	
	\textit{Step 0}: The difference between $v_{\mathrm{MARL}}$ and $v_{\mathrm{MF}}$ is essentially the time-discounted sum of differences between the average $N$-agent reward and average mean-field (MF) reward at time $t$. Our first goal, therefore, is to estimate the difference between these rewards.
	
	\textit{Step 1}: Average $N$-agent reward at $t$ depends on weighted empirical distributions $\{\boldsymbol{\mu}_t^{i,N}\}_{i\in [N]}$, $\{\boldsymbol{\nu}_t^{i,N}\}_{i\in [N]}$ whereas average MF reward depends on the distributions $\boldsymbol{\mu}_t, \boldsymbol{\nu}_t$. To estimate their difference, we first compute the difference between average $N$-agent reward at $t$ and average MF reward at the same instant generated from the distribution $\boldsymbol{\mu}_t^N$. This estimate is provided by Lemma \ref{lemma_7} in the Appendix. Assumption \ref{assumption_2} is invoked to establish this result.
	
	\textit{Step 2}:  Next we estimate the difference between the average MF reward generated by $\boldsymbol{\mu}_t^N$ and that generated by $\boldsymbol{\mu}_t$. Lemma \ref{lemma_3} in the Appendix bounds this difference by a term proportional to $|\boldsymbol{\mu}_t^N-\boldsymbol{\mu}_t|$.
	
	\textit{Step 3}: Using Lemma \ref{lemma_2} and \ref{lemma_6}, we now establish a recursive relation on $|\boldsymbol{\mu}_t^N-\boldsymbol{\mu}_t|$. Via induction, we can now write this difference as a function of $t$.
	
	\textit{Step 4}: Finally, by computing a time-discounted sum of all the upper bounds described above, we arrive at the desired result.
	
	
	
	\section{Solution of MFC via Natural Policy Gradient Algorithm}
	\label{sec:npg}
	
	In this section, we develop a Natural Policy Gradient (NPG) algorithm to solve the MFC problem. By virtue of Theorem \ref{theorem_1}, it provides an approximate solution to the non-uniform MARL problem. Recall from section \ref{sec:MFC} that, in MFC, it is sufficient to track only one representative agent. At time $t$, that agent takes its decision $u_t$ based on its own state $x_t$, and the mean-field state distribution $\boldsymbol{\mu}_t$. Thus, MFC essentially reduces to a single-agent Markov Decision Problem (MDP) with extended state space $\mathcal{X} \times \mathcal{P}(\mathcal{X})$ and action space $\mathcal{U}$. To solve MFC, it is therefore sufficient to consider only stationary policies \citep{puterman2014markov}. 
	
	Let the set of stationary policies be denoted by $\Pi$ and its elements be parameterized by $\Phi\in \mathbb{R}^{\mathrm{d}}$. For a given policy $\pi_{\Phi}\in \Pi$, we shall define its sequence as $\boldsymbol{\pi}_{\Phi}\triangleq \{\pi_{\Phi}, \pi_{\Phi}, \cdots\}$. Let, $Q_{\Phi}$ be the Q-function associated with policy $\pi_{\Phi}$. We define $Q_{\Phi}(x,\boldsymbol{\mu}, u)$ for arbitrary $x\in\mathcal{X}$, $\boldsymbol{\mu}\in\mathcal{P}(\mathcal{X})$, and $u\in \mathcal{U}$, as follows.
	\begin{align}
		\begin{split}
			&Q_{\Phi}(x, \boldsymbol{\mu}, u) \triangleq\\
			& \mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r(x_t, u_t, \boldsymbol{\mu}_t, \boldsymbol{\nu}_t)\Big|x_0=x, \boldsymbol{\mu}_0=\boldsymbol{\mu}, u_0=u\right]
		\end{split}
	\end{align}
	where the expectation is over $u_{t+1}\sim \pi_{\Phi}(x_{t+1}, \boldsymbol{\mu}_{t+1})$, and $x_{t+1}\sim P(x_{t}, u_{t}, \boldsymbol{\mu}_{t}, \boldsymbol{\nu}_{t})$, $\forall t\in \mathbb{T}$. The mean-field distributions $\{\boldsymbol{\mu}_{t+1}, \boldsymbol{\nu}_t\}_{t\in \mathbb{T}}$ are updated via deterministic update equations $(\ref{nu_MF})$, and $(\ref{mu_t_plus_1})$. We now define the advantage function as follows.
	\begin{align}
		A_{\Phi}(x, \boldsymbol{\mu}, u) \triangleq Q_{\Phi}(x, \boldsymbol{\mu}, u) - \mathbb{E} [Q_{\Phi}(x, \boldsymbol{\mu}, u)]
	\end{align}
	where the expectation is over $u\sim \pi_{\Phi}(x, \boldsymbol{\mu})$.
	
	Let, $v^*_{\mathrm{MF}}(\boldsymbol{\mu}_0)=\sup_{\Phi\in\mathbb{R}^{\mathrm{d}}}v_{\mathrm{MF}}(\boldsymbol{\mu}_0, \boldsymbol{\pi}_{\Phi})$ where $v_{\mathrm{MF}}$ is the value function of MFC problem and is defined in $(\ref{v_MF})$. Let, $\{\Phi_j\}_{j=1}^J$ be a sequence of parameters that are generated by the NPG algorithm \citep{liu2020improved, agarwal2021theory} as follows.
	\begin{equation}
		\label{npg_update}
		\Phi_{j+1} = \Phi_j + \eta \mathbf{w}_j,  \mathbf{w}_j\triangleq{\arg\min}_{\mathbf{w}\in\mathbb{R}^{\mathrm{d}}} ~L_{ \zeta_{\boldsymbol{\mu}_0}^{\Phi_j}}(\mathbf{w},\Phi_j)
	\end{equation}
	
	The term $\eta$ is defined as the learning parameter. The function $L_{ \zeta_{\boldsymbol{\mu}_0}^{\Phi_j}}$ and the distribution $\zeta_{\boldsymbol{\mu}_0}^{\Phi_j}$ are defined below. 
	\begin{align}
		\begin{split}
			&L_{ \zeta_{\boldsymbol{\mu}_0}^{\Phi'}}(\mathbf{w},\Phi)\triangleq\mathbb{E}_{(x,\boldsymbol{\mu},u)\sim \zeta_{\boldsymbol{\mu}_0}^{\Phi'}}\Big[\Big(A_{\Phi}(x,\boldsymbol{\mu},u)\\
			&
			-(1-\gamma)\mathbf{w}^{\mathrm{T}}\nabla_{\Phi}\log \pi_{\Phi}(x,\boldsymbol{\mu})(u) \Big)^2\Big],
		\end{split}\\
		\label{zeta_dist}
		\begin{split}
			&\zeta_{\boldsymbol{\mu}_0}^{\Phi'}(x,\boldsymbol{\mu},u)\triangleq \sum_{\tau=0}^{\infty}\gamma^{\tau} \mathbb{P}(x_\tau=x,\boldsymbol{\mu}_{\tau}=\boldsymbol{\mu},u_\tau=u
			|\\
			&x_0=x,\boldsymbol{\mu}_0=\boldsymbol{\mu},u_0=u,\boldsymbol{\pi}_{\Phi'})(1-\gamma)
		\end{split}
	\end{align}
	
	NPG update (\ref{npg_update}) indicates that, at each iteration, one must solve another minimization problem to obtain the gradient direction. It can be solved by applying a stochastic gradient descent (SGD) approach. In particular, the update equation, in this case, turns out to be the following: $\mathbf{w}_{j,l+1}=\mathbf{w}_{j,l}-\alpha\mathbf{h}_{j,l}$ \citep{liu2020improved}. The term $\alpha$ is the learning rate for this sub-problem. The update direction $\mathbf{h}_{j,l}$ can be defined as follows.
	\begin{align}
		\begin{split}
			\mathbf{h}_{j,l}&\triangleq \Bigg(\mathbf{w}_{j,l}^{\mathrm{T}}\nabla_{\Phi_j}\log \pi_{\Phi_j}(x,\boldsymbol{\mu})(u)\\
			&-\dfrac{1}{1-\gamma}\hat{A}_{\Phi_j}(x,\boldsymbol{\mu},u)\Bigg)
			\nabla_{\Phi_j}\log \pi_{\Phi_j}(x,\boldsymbol{\mu})(u)
		\end{split}
		\label{sub_prob_grad_update}
	\end{align}
	where $(x,\boldsymbol{\mu},u)\sim\zeta_{\boldsymbol{\mu}_0}^{\Phi_j}$, and $\hat{A}_{\Phi_j}$ is a unbiased estimator of $A_{\Phi_j}$. The process to obtain the samples and the estimator has been detailed in Algorithm \ref{algo_2} in the Appendix \ref{sampling_process}. We would like to point out that Algorithm \ref{algo_2} is based on Algorithm 3 of \citep{agarwal2021theory}. We summarize the whole NPG process in Algorithm \ref{algo_1}.
	
	\begin{algorithm}
		\caption{Natural Policy Gradient}
		\label{algo_1}
		\textbf{Input:} $\eta,\alpha$: Learning rates, $J,L$: Number of execution steps\\
		\hspace{1.3cm}$\mathbf{w}_0,\Phi_0$: Initial parameters, $\boldsymbol{\mu}_0$: Initial state distribution\\
		\textbf{Initialization:} $\Phi\gets \Phi_0$ 
		\begin{algorithmic}[1]
			\FOR{$j\in\{0,1,\cdots,J-1\}$}
			{
				\STATE $\mathbf{w}_{j,0}\gets \mathbf{w}_0$\\
				\FOR {$l\in\{0,1,\cdots,L-1\}$}
				{
					\STATE Sample $(x,\boldsymbol{\mu},u)\sim\zeta_{\boldsymbol{\mu}_0}^{\Phi_j}$ and $\hat{A}_{\Phi_j}(x,\boldsymbol{\mu},u)$ using Algorithm \ref{algo_2}\\
					\STATE Compute $\mathbf{h}_{j,l}$ using $(\ref{sub_prob_grad_update})$\\
					$\mathbf{w}_{j,l+1}\gets\mathbf{w}_{j,l}-\alpha\mathbf{h}_{j,l}$
				}
				\ENDFOR
				\STATE	$\mathbf{w}_j\gets\dfrac{1}{L}\sum_{l=1}^{L}\mathbf{w}_{j,l}$\\
				\STATE	$\Phi_{j+1}\gets \Phi_j +\eta \mathbf{w}_j$
			}
			\ENDFOR
		\end{algorithmic}
		\textbf{Output:} $\{\Phi_1,\cdots,\Phi_J\}$: Policy parameters
	\end{algorithm}
	
	The global converge of NPG is stated in Lemma \ref{lemma_0} which is a direct consequence of Theorem 4.9 of \citep{liu2020improved}. However, the following assumptions are needed to establish the Lemma. These are similar to Assumptions 2.1, 4.2, 4.4 respectively in \citep{liu2020improved}.
	
	
	\begin{assumption}
		\label{ass_6}
		$\forall \Phi\in\mathbb{R}^{\mathrm{d}}$, $\forall \boldsymbol{\mu}_0\in\mathcal{P}(\mathcal{X})$, for some $\chi >0$,  $F_{\boldsymbol{\mu}_0}(\Phi)-\chi I_{\mathrm{d}}$ is positive semi-definite  where $F_{\boldsymbol{\mu}_0}(\Phi)$ can be expressed as follows.
		\begin{align*}
			F_{\boldsymbol{\mu}_0}(\Phi)&\triangleq \mathbb{E}_{(x,\boldsymbol{\mu},u)\sim \zeta_{\boldsymbol{\mu}_0}^{\Phi}}\Big[\left\lbrace\nabla_{\Phi}\pi_{\Phi}(x,\boldsymbol{\mu})(u)\right\rbrace\\
			&\times\left\lbrace\nabla_{\Phi}\log\pi_{\Phi}(x,\boldsymbol{\mu})(u)\right\rbrace^{\mathrm{T}}\Big]
		\end{align*} 
	\end{assumption}
	
	\begin{assumption}
		\label{ass_7}
		$\forall \Phi\in\mathbb{R}^{\mathrm{d}}$, $\forall \boldsymbol{\mu}\in\mathcal{P}(\mathcal{X})$, $\forall x\in\mathcal{X}$, $\forall u\in\mathcal{U}$, 
		\begin{align*}
			\left|\nabla_{\Phi}\log\pi_{\Phi}(x,\boldsymbol{\mu})(u)\right|_1\leq G
		\end{align*}
		for some positive constant $G$.
	\end{assumption}
	
	\begin{assumption}
		\label{ass_8}
		$\forall \Phi_1,\Phi_2\in\mathbb{R}^{\mathrm{d}}$, $\forall \boldsymbol{\mu}\in\mathcal{P}(\mathcal{X})$,  $\forall x\in\mathcal{X}$, $\forall u\in\mathcal{U}$,
		\begin{align*}
			|\nabla_{\Phi_1}\log\pi_{\Phi_1}(x,\boldsymbol{\mu})(u)&-\nabla_{\Phi_2}\log\pi_{\Phi_2}(x,\boldsymbol{\mu})(u)|_1\\
			&\leq M|\Phi_1-\Phi_2|_1
		\end{align*}
		
		for some positive constant $M$.
	\end{assumption}
	
	
	
	
	\begin{assumption}
		\label{ass_9}
		$\forall \Phi\in\mathbb{R}^{\mathrm{d}}$, $\forall \boldsymbol{\mu}_0\in\mathcal{P}(\mathcal{X})$, 
		\begin{align*}
			L_{\zeta_{\boldsymbol{\mu}_0}^{\Phi^*}}(\mathbf{w}^{*}_{\Phi},\Phi)\leq \epsilon_{\mathrm{bias}}, ~~\mathbf{w}^*_{\Phi}\triangleq{\arg\min}_{\mathbf{w}\in\mathbb{R}^{\mathrm{d}}} L_{\zeta_{\boldsymbol{\mu}_0}^{\Phi}}(\mathbf{w},\Phi) 
		\end{align*}
		where $\Phi^*$ is the parameter of the optimal policy.
	\end{assumption}
	
	\begin{lemma}
		\label{lemma_0}
		Let $\{\Phi_j\}_{j=1}^J$ be the sequence of policy parameters obtained from Algorithm \ref{algo_1}. If Assumptions \ref{ass_6}$-$\ref{ass_9} hold, then the following inequality holds for some $\eta, \alpha, J,L$, 
		\begin{align*}
			v_{\mathrm{MF}}^*(\boldsymbol{\mu}_0)-\dfrac{1}{J}\sum_{j=1}^J v_{\mathrm{MF}}({\boldsymbol{\mu}_0},\pi_{\Phi_j}) \leq \dfrac{\sqrt{\epsilon_{\mathrm{bias}}}}{1-\gamma}+\epsilon,
		\end{align*}  
		for arbitrary  initial parameter $\Phi_0$ and initial state distribution $\boldsymbol{\mu}_0\in\mathcal{P}(\mathcal{X})$. The parameter $\epsilon_{\mathrm{bias}}$ is a constant. The sample complexity of Algorithm \ref{algo_1} is $\mathcal{O}(\epsilon^{-3})$.  
	\end{lemma}
	
	The bias $\epsilon_{\mathrm{bias}}$ turns out to be small for rich neural network based policies \citep{liu2020improved}. Intuitively, it indicates the expressive power of the policy class, $\Pi$.
	
	Lemma \ref{lemma_0} establishes that Algorithm \ref{algo_1} can approximate the optimal mean-field value function with an error bound of $\epsilon$, and a sample complexity of $\mathcal{O}(\epsilon^{-3})$.  Using Theorem \ref{theorem_1}, we can now state the following result.
	\begin{theorem}
		\label{corr_1}
		Let $\boldsymbol{x}_0\triangleq\{x_0^i\}_{i\in[N]}$ be the initial states in an $N$-agent system and $\boldsymbol{\mu}_0$ their associated empirical distribution. Assume that $\{\Phi_j\}_{j=1}^J$ are the policy parameters generated from Algorithm \ref{algo_1}, and the set of policies, $\Pi$ satisfies Assumption \ref{assumption_3}. If Assumptions \ref{assumption_1}, \ref{assumption_2}, \ref{assumption_4}, \ref{ass_6} - \ref{ass_9} are satisfied, then, for any $\epsilon>0$, the following inequality holds for certain choices of $\eta, \alpha,J,L$  
		\begin{align}
			\label{eq_thm4}
			\begin{split}
				&	\left|\sup_{\Phi\in\mathbb{R}^{\mathrm{d}}}v_{\mathrm{MARL}}(\boldsymbol{x}_0,\pi_{\Phi})-\dfrac{1}{J}\sum_{j=1}^J v_{\mathrm{MF}}({\boldsymbol{\mu}_0},\pi_{\Phi_j})\right|\\
				&\hspace{3cm}\leq \dfrac{\sqrt{\epsilon_{\mathrm{bias}}}}{1-\gamma}+C \max\{e,\epsilon\}\\
				\text{where}& ~e\triangleq \dfrac{1}{\sqrt{N}}\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]
			\end{split}
		\end{align}  
		whenever $\gamma S_P<1$ where $S_P$ is given in Theorem \ref{theorem_1}. The term, $C$ is a constant and the parameter $\epsilon_{\mathrm{bias}}$ is defined in Lemma \ref{lemma_0}. The sample complexity of the process is $\mathcal{O}(\epsilon^{-3})$.
	\end{theorem}
	\begin{proof} Note that following inequality,
		\begin{align*}
			\begin{split}
				&	\left|\sup_{\Phi\in\mathbb{R}^{\mathrm{d}}}v_{\mathrm{MARL}}(\boldsymbol{x}_0,\pi_{\Phi})-\dfrac{1}{J}\sum_{j=1}^J v_{\mathrm{MF}}({\boldsymbol{\mu}_0},\pi_{\Phi_j})\right|\\
				& \leq 	\left|\sup_{\Phi\in\mathbb{R}^{\mathrm{d}}}v_{\mathrm{MARL}}(\boldsymbol{x}_0,\pi_{\Phi}) - v_{\mathrm{MF}}^*(\boldsymbol{\mu}_0)\right| \\
				& + \left|v_{\mathrm{MF}}^*(\boldsymbol{\mu}_0)-\dfrac{1}{J}\sum_{j=1}^J v_{\mathrm{MF}}({\boldsymbol{\mu}_0},\pi_{\Phi_j})\right|\\
			\end{split}
		\end{align*}
		

	
	
		Using Theorem \ref{theorem_1}, the first term can be bounded by $C'e$ for some constant $C'$. The second term can be bounded by $\sqrt{\epsilon_{\mathrm{bias}}}/(1-\gamma) + \epsilon$ with a sample complexity of $\mathcal{O}(\epsilon^{-3})$ (Lemma \ref{lemma_0}). Assigning $C=2\max\{C', 1\}$, we conclude the result. 
	\end{proof}
	
	Theorem \ref{corr_1} guarantees that Algorithm \ref{algo_1} can yield a policy such that its associated value is $\mathcal{O}(\max\{e,\epsilon\})$ error away from the optimal value of the non-uniform MARL problem. Moreover, it also dictates such a policy can be obtained with a sample complexity of $\mathcal{O}(\epsilon^{-3})$.
	
	\section{Experiments}
	




	Let the policy sequence that maximizes the mean-field value function $v_{\mathrm{MF}}(\boldsymbol{\mu}_0, \cdot)$ be denoted as $\boldsymbol{\pi}^*_{\mathrm{MF}}$ where $\boldsymbol{\mu}_0$ indicates the empirical distribution of the initial joint state, $\boldsymbol{x}_0^N$. We define the percentage error as follows.
\begin{align}
\mathrm{error}\triangleq \left|\dfrac{v_{\mathrm{MARL}}(\boldsymbol{x}_0^N, \boldsymbol{\pi}^*_{\mathrm{MF}})-v_{\mathrm{MF}}(\boldsymbol{\mu}_0, \boldsymbol{\pi}_{\mathrm{MF}}^*)}{v_{\mathrm{MF}}(\boldsymbol{\mu}_0, \boldsymbol{\pi}_{\mathrm{MF}}^*)}\right|\times 100\%
\label{def_error}
\end{align}


			\begin{figure}
	\centering
	\includegraphics[width=\linewidth]{Fig1.png}
	\caption{Percentage $\mathrm{error}$ (defined by $(\ref{def_error})$) as a function of $N$. Reward, state transition, and agent interaction matrix are same as stated in Example \ref{example_1}. The bold line and the half-width of the shaded region respectively denote the mean, and the standard deviation values of the $\mathrm{error}$ obtained over $25$ random seeds. The values of various system parameters used in the experiment are as follows: $K=5$, $\alpha_R=1$, $\beta_R=$ $\lambda_R=0.5$, and $Q=10$. The hyperparameter values used in  Algorithm \ref{algo_1} are as follows: $\alpha = \eta = 10^{-3}$, $J=L=10^2$.  We use a feed forward neural network with a single hidden layer as the policy approximator.}
	\label{fig_1}
\end{figure}

\begin{figure*}
	\centering
	\begin{subfigure}{0.45\textwidth}
		\centering
		\includegraphics[width=\linewidth]{Fig2a.png}
		\caption{$\sigma=1.1$}
		\label{fig_2a}
	\end{subfigure}
	\begin{subfigure}{0.45\textwidth}
		\centering
		\includegraphics[width=\linewidth]{Fig2b.png}
		\caption{$\sigma=1.2$}
		\label{fig_2b}
	\end{subfigure}
	\caption{Percentage $\mathrm{error}$ when the reward function is given by $(\ref{new_reward})$. All other parameters are same as in Fig. \ref{fig_1}.}
	\label{fig_2}
\end{figure*}

We can approximately obtain $\boldsymbol{\pi}^*_{\mathrm{MF}}$ using Algorithm \ref{algo_1}. Fig. \ref{fig_1} plots the value of $\mathrm{error}$ (defined in $(\ref{def_error})$) as a function of $N$ for the reward, transition function, and  interaction model described in Example \ref{example_1}. The values of various parameters used in this numerical experiment are provided in the description of Fig. \ref{fig_1}. Evidently, the $\mathrm{error}$ decreases with $N$. Notice that the reward function stated in Example \ref{example_1} (thereby that is used for generating Fig. \ref{fig_1}) is linear in its mean-field distribution argument. In Fig. \ref{fig_2}, we exhibit the $\mathrm{error}$ as a function of $N$ with the following non-linear reward function.
\begin{align}
r(x_t^i, u_t^i,\boldsymbol{\mu}_t^{i,N}, \boldsymbol{\nu}_t^{i,N}) = \alpha_R x_t^i - \beta_R(\bar{\boldsymbol{\mu}}_t^{i,N})^\sigma -  \lambda_R u_t^i
\label{new_reward}
\end{align}

The term $\sigma$ is a measure of non-linearity. All other parameters are same as stated in Example \ref{example_1}. Observe that if $\sigma=1$, the reward function stated above turns out to be identical to the reward function given in Example \ref{example_1}. In Fig. \ref{fig_2a}, and \ref{fig_2b} we plot $\mathrm{error}$ for $\sigma=1.1, 1.2$ respectively. In both of these scenarios, we see the $\mathrm{error}$ to be a decreasing function of $N$. This indicates that although our MFC-based approximation results are theoretically proven for affine rewards only, they  empirically hold for non-affine rewards as well.

The codes for generating these results are publicly available at: https://github.com/washim-uddin-mondal/UAI2022




	
	\section{Conclusion}
	
	In this article, we consider a multi-agent reinforcement learning (MARL) problem where the interaction between agents is described by a doubly stochastic matrix. We prove that, if the reward function is affine, one can well-approximate this non-uniform MARL problem via an associated Mean-Field Control (MFC) problem. We obtain an upper bound of the approximation error as a function of the number of agents, and also propose a natural policy gradient (NPG) algorithm to solve the MFC problem with polynomial sample complexity. The obvious drawback of our approach is the restriction on the structure of the reward function. Therefore, extension of our techniques to non-affine reward functions is an important future goal. 
	
%	\begin{contributions} % will be removed in pdf for initial submission, so you can already fill it to test with the ‘accepted’ class option
%	\end{contributions}
	
	\begin{acknowledgements} % will be removed in pdf for initial submission, so you can already fill it to test with the ‘accepted’ class option
	W. U. M., and S. V. U. were partially funded by NSF Grant No. 1638311 CRISP Type 2/Collaborative Research: Critical Transitions in the Resilience and Recovery of Interdependent Social and Physical Networks.
	\end{acknowledgements}
	
	\bibliography{mondal_582}
\end{document}
