\section{Reproducibility Statement}
Our source code is provided in \url{https://sites.google.com/view/cole-2023/}.

Besides, more reproducibility information could be found in the appendix.
\begin{itemize}
\vspace{-3mm}
\setlength{\itemsep}{1pt}
\setlength{\parskip}{1pt}
\setlength{\parsep}{1pt}
    \item The detailed pseudocode of the Graphic Shapley Value Solver is provided in Appendix~\ref{appedix:solver}.
    \item In Appendix~\ref{appedix:layouts}, we introduce the details of the experimental environment - the Overcooked game.
    \item The implementation and hyperparameters used in experiments are in Appendix~\ref{appendix:cole}.
\end{itemize}



\section{Proofs of Theorem~\ref{thm: converge}}
\label{appendix:proofs_thm}
\FirstTHM*
\begin{proof}
   
   
    
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   

   
   
   
   
   
   
   
   
   
   
   
   
   
    
   

   

According to the definition of the local best-preferred strategy, the local optimal strategy is the node with zero preference centrality ($\eta$). Therefore, we need to prove that the value of $\eta$ will converge to zero.

Let $\eta_t$ denote the centrality value of the preference of the updated strategy $s_t$ in generation $t$, where $0\leq \eta \leq 1$. We assume that the algorithm makes some improvement in step $t$. With the assumption that not all optimization steps fail to improve, we can deduce that \begin{equation} \eta_t = \eta_{t-1} - \epsilon, \end{equation} where $\epsilon$ is a positive value and $0< \epsilon \leq \eta_{t-1}$. By further simplifying the equation, we have \begin{equation} \begin{aligned} \eta_t &= \eta_{t-1} - \epsilon,\ &=\eta_{t-1} -\alpha_{t-1} \eta_{t-1},\ &=\beta_{t-1} \eta_{t-1}, \end{aligned} \end{equation} where $0< \alpha_{t-1} \leq 1$ and $\beta_{t-1} = 1- \alpha_{t-1}$.

Assuming that the centrality value of the preference in the initial step is $0\leq \eta_0 \leq 1$, we can recursively calculate the following formula: \begin{equation} \begin{aligned} \eta_t &=\beta_{t-1} \eta_{t-1},\ &=\beta_{t-1} \beta_{t-2} \eta_{t-2},\ &=\cdots, \ &=\prod_{i=0}^{t-1} \beta_i \times \eta_{0}. \end{aligned} \label{eq:iter_eta} \end{equation} For any $\beta \in {\beta_0, \cdots, \beta_{t-1}}$, we have $\beta\geq 0$. In addition, we set $\eta_t$ as a very small positive number if $\eta_t=0$.

Therefore, by assuming that $\beta$ is not always equal to zero, we can conclude that $\eta_t$ will approach zero as outlined in~\eqref{eq:iter_eta}. Through this proof, we have established that if the assumption that $\lim_{i\rightarrow \infty}{\mathcal{J}}(s_i)\geq {\mathcal{J}}(s_{i-1})$ holds, the sequence ${s_i}$ for ${i\in \mathbb{N}}$ converges to a local optimal strategy $s^*$, also known as the local best-preferred strategy.

\end{proof}

\section{Proof of Corollary~\ref{lemma: converge_rate}}
\label{appendix:proofs_corollary}
\FirstLEMMA*
\begin{proof}
   
   
   
   
   
   
   
    In Theorem~\ref{thm: converge}, we have proved that the strategies generated by the {COLE} framework\xspace~will converge to the local best-preferred strategy.
When we use the in-degree centrality function as $\eta$, the preference centrality function can be rewritten as:
\begin{equation}
        \eta(i) = 1-\frac{I_i}{n-1},
    \end{equation}
where $I_i$ is the in-degree of node $i$ and $n$ is the size of the strategy set ${\mathcal{N}}$.
    Therefore, we have
    \begin{equation}
\begin{aligned}
\label{eq:proof_1}
\lim\limits_{t \to \infty} \frac{|\eta_{t+1} - 0|}{|\eta_{t} - 0|} 
        = &\lim\limits_{t \to \infty} \frac{\eta_{t+1}}{\eta_{t}} \\
        =& \lim\limits_{t \to \infty} \frac{1-\frac{k_{t+1}}{t}}{1-\frac{k_{t}}{t-1}} \\
        = &\lim\limits_{t \to \infty} \frac{t-1}{t} \frac{t - k_{t+1}}{t-k_t-1} \\
        =&\lim\limits_{t \to \infty} \frac{t - k_{t+1}}{t-k_t-1} \\
        =&1
\end{aligned}
\end{equation}
   
   
   
   
   
   
   
   
   
   
   

Therefore, using the in-degree centrality, we can conclude that the {COLE} framework\xspace~will converge to the local optimal strategy at a Q-sublinear rate.
\end{proof}







\section{Graphic Shapley Value Solver Algorithm }
\label{appedix:solver}
Algorithm~\ref{algo:solver} gives the detailed steps of the graphic Shapley value solver in Section~\ref{sec:solver}.

\begin{algorithm}[ht]
\caption{Graphic Shapley Value Solver Algorithm}
\label{algo:solver}
\begin{algorithmic}[1]
\STATE \algorithmicrequire: population ${\mathcal{N}}$, the number of Monte Carlo permutation sampling $k$, the size of negative population

\STATE Initialize $\phi = \mb{0}_{|{\mathcal{C}}|}$
\FOR{$(1,2,\cdots, k)$}
    \STATE $\pi \longleftarrow \textit{Uniformly sample from } \Pi_{\mathcal{C}}$, where $\Pi_{\mathcal{C}}$ is permutation set
    \FOR{$i\in \mc{N}$}
    \COMMENT{Obtain predecessors of player $i$ in sampled permutation $\pi$}
    \STATE $S_\pi(i) \longleftarrow \{j\in \mc{N} | \pi(j)<\pi(i)\}$
    \COMMENT{Update incompatibility weights}
    \STATE $\phi_i\longleftarrow {\phi}_i + \frac{1}{k}({v(S_{\pi}(i)\cup \{i\})-v(S_{\pi}(i)))}$
    \ENDFOR
    \ENDFOR
\STATE $\phi \longleftarrow \phi/\sum\phi$
\STATE $\phi \longleftarrow (1-\phi)/\sum(1-\phi)$
\STATE \algorithmicensure: $\phi$
\end{algorithmic}
\end{algorithm}


\section{Overcooked Environment}
\label{appedix:layouts}
In this paper, we conduct a series of experiments in the Overcooked environment~\citep{HARL,charakorn2020investigating,knott2021evaluating}, which is proposed for the coordination challenge, to verify the performance of \algo. 
As a two-player common payoff game, each player controls one chef in a kitchen to cook and serve soup, which results in a reward of 20 for the team. We test our codes on five different layouts: Cramped Room, Asymmetric Advantages, Coordination Ring, Forced Coordination, and Counter Circuit. 


The Overcooked environment that we used has five layouts, including \textbf{Cramped Room}, \textbf{Asymmetric Advantages}, \textbf{Coordination Ring}, \textbf{Forced Coordination}, and \textbf{Counter Circuit}. Screenshots of these layouts can be seen in Fig.~\ref{fig:overcooked-layouts}.

\begin{figure}[ht] \centering    
\subfigure[{\scriptsize Cramped Room}] {\includegraphics[height=65pt]{figures/overcooked/simple.jpg}
}   
\subfigure[{\scriptsize Asymmetric Advantages}] {   
    \includegraphics[height=65pt]{figures/overcooked/unident_s.jpg}  
}   
\subfigure[{\scriptsize Coordination Ring}] {   
    \includegraphics[height=65pt]{figures/overcooked/random1.jpg}  
}   
\subfigure[{\scriptsize Forced Coordination}] {   
    \includegraphics[height=65pt]{figures/overcooked/random0.jpg}  
}   
\subfigure[{\scriptsize Counter Circuit}] {   
    \includegraphics[height=65pt]{figures/overcooked/random3.jpg}  
}   

\caption{ Overcooked environment layouts.}     
\label{fig:overcooked-layouts}     
\end{figure}







The detailed introduction of five layouts is as follows.
\begin{enumerate}[label=(\alph*)]
\vspace{-3mm}
\setlength{\itemsep}{1pt}
\setlength{\parskip}{1pt}
\setlength{\parsep}{1pt}
   

   
   
   
   

   
    
   
   
   
   
   
   
   
   
    \item \textbf{Cramped Room}. The cramped room is a simple environment where two players are limited to a small room with only one pot (black box with gray bottom) and one serving spot (light gray square). Therefore, players are expected to fully utilize the pot and effectively deliver soup, even with basic coordination.

\item \textbf{Asymmetric Advantages}. In this layout, two players are placed in two disconnected kitchens. As the name suggests, the positions of onions, pots, and serving spots are asymmetric. In the left kitchen, onions are far from the pots, while serving spots are near the middle area of the layout. However, in the right kitchen, onions are placed near the middle area and the serving areas are far from the pots.

\item \textbf{Coordination Ring}. This ring-like layout requires both players to keep moving to prevent blocking each other, especially in the top-right and bottom-left corners where the onions and pots are located. For optimal cooperation, both pots should be utilized.

\item \textbf{Forced Coordination}. The Forced Coordination is another layout that separates the two agents. There are no pots or serving spots on the left side, nor are there onions or pots on the right side. Therefore, two players must coordinate with each other to complete the task. The left player is expected to prepare onions and plates while the right player cooks and serves them.

\item \textbf{Counter Circuit}. The Counter Circuit is another ring-like layout but larger in map size. In this layout, pots, onions, plates, and serving spots are placed in four different directions. Limited by the narrow aisles, players are easily blocked. Therefore, coordinating and performing the task is difficult in this environment. Players need to learn the advanced technique of putting onions in the middle area to pass them to the other quickly, which can further improve performance.        
\end{enumerate}

\section{Experimental Details of \algo}
\label{appendix:cole}
This paper utilizes Proximal Policy Optimization (PPO)~\citep{PPO} as the oracle algorithm for our strategy set ${\mathcal{N}}$, which consists of convolutional neural network parameterized strategies. Each network is composed of 3 convolution layers with 25 filters and 3 fully-connected layers with 64 hidden neurons. To manage computational resources, we maintain a population size of 50 strategies. In instances where the population exceeds this limit, we randomly select one of the earliest 10 strategies for removal.

We run and evaluate all our experiments on Linux servers, which include two types of nodes: 1) 1-GPU node with NVIDIA GeForce 3090Ti 24G as GPU and AMD EPYC 7H12 64-Core Processor as CPU, 2) 2-GPUs node with GeForce RTX 3090 24G as GPU and AMD Ryzen Threadripper 3970X 32-Core Processor as CPU.
On the Overcooked game environment, the \algo takes about one to two days on the 2-GPUs machine for one layout's training.

The hyperparameter setup is similar to those in PBT and MEP, which are given as follows. 
\begin{itemize}
    \item The learning rate for each layout is  2e-3 , 1e-3 , 6e-4 , 8e-4 , and 8e-4.
    \item The gamma $\gamma$ is 0.99.
    \item The lambda $\lambda$ is 0.98.
    \item The PPO clipping factor is 0.05.
    \item The VF coefficient is 0.5.
    \item The maximum gradient norm is 0.1.
    \item The total training time steps for each PPO update is 48000, divided into 10 mini-batches.
    \item The total numbers of generations for each layout are 80, 60, 75, 70, and 70, respectively.
    \item For each generation, we update 10 times to approximate the best-preferred strategy.
    \item The $\alpha$ is 1.
\end{itemize}

\section{Implementations of Baselines}
\label{appendix:base}
In this part, we will introduce the detailed implementations of baselines.
We train and evaluate the self-play and PBT based on the Human-Aware Reinforcement Learning repository~\citep{HARL} \footnote{\url{https://github.com/HumanCompatibleAI/human_aware_rl/tree/neurips2019}.} and used Proximal Policy Optimization (PPO)~\citep{PPO} as the RL algorithm.
We implement FCP according to the FCP paper~\citep{FCP} and use PPO as the RL algorithm.
The implementation is based on the Human-Aware Reinforcement Learning repository (the same used in the self-paly and PBT).
The MEP agent is trained with population size as 5, following the MEP paper~\citep{MEP} and used the original implementation\footnote{The code of MEP original implementation: \url{https://github.com/ruizhaogit/maximum_entropy_population_based_training}.}.



\section{Trajectory Visualization}\label{Trajectory}
We visualize the trajectories produced by \algo 1:3 and 0:4 with middle-level and expert partners in Overcooked at \url{https://sites.google.com/view/cole-2023/}.
Fig.~\ref{fig:case_study} presents three screenshots of the \algo 0:4 model (blue player) that collaborates with one of the expert partners, the PBT model (green player). 
The case illustrates the importance of the individual objective in zero-shot coordination with expert partners. 
Frame A is a screenshot taken at 53s when the two players start to impede each other. 
The PBT model has taken the plate and wants to load and serve the dish. 
The blue player wants to take the plate but does not know how to change the objective to allow the green player to load the dish. 
After blocking for about 11s, the blue player starts to move and lets the green player go to the pots (Frame B). 
However, the process is not smooth and takes 7s to reach Frame C. 
This phenomenon does not occur in \algo 1:3 coordination with expert partners, which shows that including individual objectives might improve the cooperative ability with expert partners.

\begin{figure}[h]
    \centering
\includegraphics[width=0.75\linewidth]{figures/case_study.pdf}
\caption{
{Trajectory snapshots of the \algo 0:4 model (blue) with one of the expert partners - PBT model (green).}
}
    \label{fig:case_study}
\end{figure}


\section{Limitations and Future Work}
\label{appendix:limits}

Convergence of {COLE} framework\xspace requires satisfying the assumption that the preference centrality of the newly generated strategy is ranked in the top $k$, which is controlled by an additional hyperparameter.
If $k$ is too big with a lower ranking, the {COLE} framework\xspace will slowly converge to the best-preferred strategy. 
On the other hand, if $k$ is too small and refers to a higher ranking, each generation's assumption will not be guaranteed, which will easily cause learning failure.
Besides, in our implemented algorithm~\algo, we introduce the Shapley Value as the tool and develop the Graphic Shapley Value to analyze the cooperative ability.
Although we have utilized the Monte Carlo permutation sampling to reduce the computational complexity, the computational complexity is still high.
Therefore, we only maintain a population of 50 for the limitation of computational resources.


Future work will focus on performing an adaptive mechanism that automatically selects a suitable value for the hyperparameter $k$ to improve the convergence rate without promoting iterations of each update. 
Meanwhile, improving the efficiency of the graphic Shapley Value solver and exploring other cooperative ability evaluation solvers are important in developing the framework.
Future work also includes the development of practical algorithms for more complex games except for Overcooked.
\section{Introduction}


Recently, MARL has achieved huge success in zero-sum games, especially competitive video games.
\
However, those zero-sum games always present non-transitivity in the policy space, which leads the optimized objectives are not clear and cycle through strategies.
\
Recently works~\citep{psrorn,pipelinepsro,YingWen2021openenned, mcaleer2022anytime,mcaleer2022self} focused on finding (approximate) Nash equilibria based on the Double Oracle (DO) algorithm~\citep{doubleoracle} and developed the algorithm Policy Space Response Oracles (PSRO)~\citep{psro}.
\
Besides, non-transitivity is believed to be entwined with behavioral diversity, both in the human world and in biological systems~\citep{bio_diversity1,bio_diversity2}.
\
Therefore, some research~\citep{yaodong_diverse,YingWen2021openenned} has been performed on studying behavioral diversity to solve non-transitivity problems.

\section{Related Work}

\section{Preliminary}

\section{Evaluation Intransitivity}

\subsection{What is intransitivity?}
Intransitivity is believed to be widespread in real world games~\citep{czarnecki2020real}, which is well defined in competitive games especially zero-sum game.
\
Intransitivity in competitive games usually refers to strategical cycles in strategy space like Rocks-Picks-Scissors.
\ 
Therefore, conventional algorithms like self-play will encounter obstacles in solving these intransitivity games, i.e., the objective is no longer clear without improving overall agent strength~\citep{psrorn}.
\ 
However, "what is intransitivity in cooperative game?" is unsolved question in the area of cooperative game.
\
"Strategical cycles" is no longer a clear conception in cooperative games especially in common payoff games.

Although, it is hard to define intransitivity in cooperative strategy space.
\ 
We could focus on the intransitivity in the learning process with sequential generated policies, further solve intransitive obstacles in learning.
\
Leaving aside the inherent concept of "cycle", the essence of intransitivity in this setting is the latest policy is beat by prior policies.
\
We could give a analogous conception in cooperative games that intransitivity refers to the latest agent is not the first preference for cooperation.
\ 
Intuitively, intransitivity means the overall improvement in strength has not been maintained.



\begin{figure}[hptb]
    \centering
    \includegraphics[width=0.5\linewidth]{figures/tmp/example_graph.png}
    \caption{Example of game graph.}
    \label{fig:eg_gg}
\end{figure}

\subsection{How to determine if intransitivity exists?}

In this section, we will analyze the intransitivity in two-player general-sum games from the perspective of graph.
We firstly define a special directed weighted graph, named game graph, to describe relations and outcomes within a sequentially generated polices $\mc{N}=\{1,2,\cdots,n\}$.
\
Let $\mb{G}=\{\mc{N}, \mb{V}\}$, where $\mc{N}$ is the node set, i.e., generated polices, and $\mb{V}$ denotes the edge set.
\
For nodes $i$ and $j$, the edge from $i$ to $j$ is $(i,j)$ with the weight of $\mb{w}(i,j) = (\mathit{w}_i (i,j), \mathit{w}_j (i,j))$, where $\mathit{w}_{(\cdot)} (i,j)$ is the outcome for node $\cdot\in\{i,j\}$.
\ 
For simplicity, edge $(i,j)$ in $\mb{G}$ only starts from newer policy to prior policy, i.e., $i>j$.
\
The formal definition of game graph is given as follows.
\begin{definition}[Game Graph]
    A game graph $\mb{G}$ is a kind of directed weighted graph and defined by $\{\mc{N}, \mb{V}, \mb{w}\}$, where $\mc{N}=\{1,2,\cdots,n\}$ is node set, i.e., policy population, which is ordered by a principal like the order of generation. $\mb{V}$ denotes the edge set, and $\mb{w} : \mc{N} \longrightarrow \mathbb{N}^2$ is the weight function mapping directed edges to outcome vector with two elements, which is outcome for head node and tail node respectively. 
    
    Intuitively, any edge $(i,j)\in \mb{V}$ represents the game play relations between policy $i$ and policy $j$, and $\mb{w}(i,j) = (\mathit{w}_i (i,j), \mathit{w}_j (i,j))$ is mean reward achieved by policy $i,j$ respectively.
\end{definition}


Figure~\ref{fig:eg_gg} shows relations and outcomes within three generated policies $\mc{N}=\{1,2,3\}$.
\
For example, tuple $(e,f)$ beside directed edge is weight $\mb{w}$ from $3$ to $1$, where the first element $e$ is outcome achieved by head node $3$, and the second element $f$ is outcome by tail node $1$.

In order to better analyze each node in $\mb{G}$, we then give the definition of sub game graph.
\begin{definition}[sub game graph]
    Given a game graph $\mb{G} = \{\mc{N}, \mb{V}, \mb{w}\}$, we consider the sub game graph $\mb{G}_{i} = \{\mc{N}_i, \mb{V}_i, \mb{w}_i\}$, where $i\in \mc{N}$ is the latest node. 
    \ 
    we denote by $\mc{N}_i$ the set of all predecessors of $i$ in $\mc{N}$, i.e., we set $\mc{N}_i = \{j \in \mc{N} | j \leq i\}$. 
    \ 
    $\mb{V}_i, \mb{w}_i$ is edge set and weight function over sub-nodes $\mc{N}_i$.
\end{definition}

\begin{figure}[hptb]
    \centering
    \subfigure[Game Graph of population with 10 checkpoints during training phase.]{
        \includegraphics[width=0.48\linewidth]{figures/tmp/G_adj.png}
    }
    \subfigure[Preference graph of the population with 10 checkpoints during training phase..]{
	\includegraphics[width=0.48\linewidth]{figures/tmp/pgg.png}
    }
    \caption{Example of preference graph.}
    \label{fig:eg_pg}
\end{figure}


Furthermore, we define the preference graph $\mb{PG}=\{\mc{N},\mb{V}_{pg}\}$, which is a normal directed unweighted graph.
\
The edge $(i,j)$ in $\mb{PG}$ represents the relation that head node could achieve maximum outcome compared to play with others.
\
It is a kind of preference that $i$ prefer to play with $j$ so that $i$ could get highest reward.
\
We then use the adjacency matrix $\mc{M}_{pg}$ to describe the preference graph, where the edge from $i$ to $j$ exists if  $\mc{M}_{pg}(i,j)$, the outcome of $i$, is greater than 0.

\begin{definition}[Preference Graph]
    Given a game graph $\mb{G}=\{\mc{N},\mb{V}\}$, the preference graph $\mb{PG}$ of $\mb{G}$ is a directed unweighted graph, which defined by $\{\mc{N},\mb{V}_{pg}\}$. 
    Each edge $(i,j)$ in $\mb{V}_{pg}$ represents the preference of head node $i$ which prefers to play with tail node $j$ to get highest outcome compared to other nodes. 
    \ 
    The preference graph could be described by a adjacency matrix $\mc{M}_{pg}$,
    where $\mc{M}_{pg}(i,j) = \mc{M}(i,j)$ if the edge from $i$ to $j$ exists, otherwise $\mc{M}_{pg}(i,j) = 0$.
\end{definition}

Figure~\ref{fig:eg_pg} gives two examples of general sum game and zero-sum game.
\ 
They are typical transitive games, where latest policy could beat all others in competitive games or be the first cooperative choice of others in cooperative games.


To better describe the graph, we introduce degree centrality in complex network.
\begin{definition}[Degree Centrality]
    Given a preference graph $\mb{PG}=\{\mc{N},\mb{V}_{pg}\}$, the degree centrality of a node $u\in \mc{N}$ is defined as,
    $$
    \operatorname{DC}(u)=\frac{k_u}{|\mc{N}|-1},
    $$
    where $\mc{N}$ is the size of node set, and $k_u$ is the degree of node $u$.
    Specifically, we denote $\operatorname{DC}_{i}(u)$ as indegree centrality, and $\operatorname{DC}_{o}(u)$ as outdegree centrality.
\end{definition}

Therefore, we could give the existence theory of intransitivity in two-player general sum games.

\begin{theorem}
Consider the two-player general-sum games, we could conduct the game graph $\mb{G}$ under a population of polices $\mc{N}=\{1,2,\cdots,n\}$.
\
For any sub game graph $\mb{G}_i$ of $\mb{G}$ and corresponding preference graph $\mb{PG}_i$, if the indegree or outdegree of latest node $i$ in $\mb{PG}_i$ is less than $n-1$ in cooperative or competitive game, we could say that intransitivity is contained in the game.
\end{theorem}

\begin{corollary}
Consider the two-player general-sum games, we could conduct the game graph $\mb{G}$ under a population of polices $\mc{N}=\{1,2,\cdots,n\}$.
\
For any sub game graph $\mb{G}_i$ of $\mb{G}$ and corresponding preference graph $\mb{PG}_i$, if the degree centrality of latest node $i$ in $\mb{PG}_i$ is less than 1, we could say that intransitivity is contained in the game.
Specifically, indegree is used in cooperative game and outdegree is for competitive game.
\end{corollary}

\subsection{How to evaluate the intransitivity?}

\subsection{Weighted Page-Rank}
The more preferred strategy, the more linkages that other strategies tend to play with them.
\ 
Therefore, we introduce the weighted page-rank (WPG) in complex network.
\
The formula of WPG is given as follows:
\begin{equation}
    \operatorname{WPG}(u)=(1-d) + d \sum_{v\in B(u)} \operatorname{WPG}(v) W_{(v,u)}^{in} W_{(v,u)}^{out},
\end{equation}
where $d$ is dampening factor that is set to $0.85$, $B(u)$ is the set of nodes that point to $u$. 
$W_{(v,u)}^{in}, W_{(v,u)}^{out}$ are popularity from the number of inlinks and outlinks.

$$
W_{(v,u)}^{in}  = \frac{I_u}{\sum_{p\in R(v)} I_p},
$$

$$
W_{(v,u)}^{out}= \frac{O_u}{\sum_{p\in R(v)}O_p},
$$

where $R(v)$ denotes the nodes $v$ links to, $I,O$ are the indegree and outdegree of the node.

\section{Method}
\subsection{Shapley Value Algorithm}
We firstly define a characteristic function game ${\mathcal{G}}$ by $({\mathcal{N}},v)$, where ${\mathcal{N}}=\{1,\dots,n\}$ is a finite, non-empty population of strategies and is a subset of strategy space ${\mathcal{S}}$, and $v: 2^{\mathcal{N}} \longrightarrow \mathbb{R}$ is a characteristic function, which maps each coalition ${\mathcal{C}} \subseteq {\mathcal{N}}$ to a real number $v({\mathcal{C}})$. 
\
Specifically, given a coalition ${\mathcal{C}} \subseteq {\mathcal{N}}$, the characteristic function is defined as follows.

\begin{equation}
    v({\mathcal{C}}) = u(\sigma, \sigma) - \min_{s_i^\prime \in {\mathcal{S}}}u(s_i^\prime, \sigma)
    \label{eq:cv}
\end{equation}

where mixed strategy $\sigma$ is uniform probability distribution over strategy coalition ${\mathcal{C}}$, and strategy $s_i$ denotes to the worst cooperation given the partner strategy.
\ 
Besides, we define the value of null coalition is 0.
\
Intuitively, the characteristic value for a coalition reflects how better the mixed strategy outperforms than the cooperation with the worst strategy.

We next prove characteristic function \eqref{eq:cv} satisfies three properties of characteristic function as follows.

\begin{itemize}
    \item Null : $v(\emptyset) = 0$,
    \item Superadditivity :$v(S \cup T) \geq v(S) + v(T), \forall S, T \in 2^{\mathcal{N}},\textit{and } S\cap T = \emptyset$
    \item Monotonicity : $\forall S, T \in 2^{\mathcal{N}}, v(S) \leq  v(T) \text{ if } S \subseteq T$
\end{itemize}

\begin{proof}
Here are proofs.

\paragraph{Null.} It is obvious that $v(\emptyset) = 0$ according to the definition.

\paragraph{Superadditivity.}  Given any coalition $S = \{s_1,s_2,\cdots, s_l\},T= \{t_1,t_2,\cdots, t_k\} \in 2^{\mathcal{N}}, S\cap T=\emptyset$, we have 
$$
v{S\cup T} = \frac{1}{l+k} (\sum_{i+1}^{l}
$$

\paragraph{Monotonicity.} Given any coalition $T \in \mc{C}$ and $S\subseteq T$, .
\end{proof}











Therefore, we could give the detailed algorithm~\ref{alg:cooperative} for identical interests normalized games.

\SetKwComment{Comment}{\color{gray}// }{ }
\RestyleAlgo{ruled,lined}
\begin{algorithm}[ht]
\caption{Meta Solver for Common Payoff Games}\label{alg:cooperative}
\SetKwInOut{Input}{Input}
\SetKwInOut{Output}{Output}
\SetKwFunction{FCV}{$v$}
\SetKwProg{Pn}{Function}{:}{\KwRet}

\Input{population $\mc{N}$ with $n$ strategies, a set of permutation $\Pi_\mc{N}$, an ego strategy $p_e$, iteration times of Monte Carlo permutation sampling $k$, payoff matrix ${\mathcal{M}}$ with shape of $n\times {n+1}$, temperature constant $t$, randomly initialized policy $b$}

\Output{ego strategy $p_e$ and population ${\mathcal{N}}$}

Complete ${\mathcal{M}}$ by playing within ${\mathcal{N}}$ and $b$.

\While{Not Converged}{

\Comment{Calculate Shapley value with MC permutation sampling}
$\phi_i \longleftarrow 0, \forall i\in \mc{N}$

\For{$(1,2,\cdots, k)$}{
    $\pi \longleftarrow \textit{Uniform Sample from } \Pi_\mc{N}$
    
    \For{$i\in \mc{N}$}{
    $S_\pi(i) \longleftarrow \{j\in \mc{N} | \pi(j)<\pi(i)\}$
    
    ${\phi}_i\longleftarrow {\phi}_i + \frac{v(S_{\pi}(i)\cup \{i\}, {\mathcal{M}}, b)-v(S_{\pi}(i), {\mathcal{M}}, b)}{k}$
    }
}

\Comment{Normalize Shapley value to probability}

$\hat{\phi} = \frac{\phi}{\sum(\exp(\phi))}$

\Comment{Best cooperation over population}
\For{$(1,2,\cdots)$}{

$p\sim \hat{\bold{\phi}}$

$p_e\longleftarrow$ Generate data and Update parameters of $p_e$ by \textit{Play}$(p_e, p)$
}

delete lowest Shapley value strategy and add $p_e$ to ${\mathcal{N}}$

Update ${\mathcal{M}}$ by $p_e$ playing with others in ${\mathcal{N}}$ and $b$

}

\Pn{\FCV{${\mathcal{C}}, {\mathcal{M}}, b$}}{
    $v = 0$
    
    \For{$i=1,2,\cdots,m$}{
    Uniformly sample $p_1$ and teammate strategy $p_{-1}$ from ${\mathcal{C}}$ respectively
    
    $v = v + \frac{1}{m}({\mathcal{M}}(p_1,p_{-1}) /t - {\mathcal{M}}(b,p_{-1}) / t)$
    }
    
    return $v$
}
\end{algorithm}

\clearpage
\section{Evaluation results on Self-play}

\begin{figure}[htb]
    \centering
    \subfigure[Game Graph adj matrix]{
        \includegraphics[width=0.48\linewidth]{figures/analysis/SP/GGM.png}
    }
    \subfigure[Preference graph adj matrix of last population]{
	\includegraphics[width=0.48\linewidth]{figures/analysis/SP/PGM.png}
    }
    
    
    \subfigure[Preference graph adj matrix of]{
	\includegraphics[width=0.48\linewidth]{figures/analysis/SP/PGM.png}
    }
\end{figure}


\begin{figure}[htb]
    \centering
    \subfigure[WPG]{
	\includegraphics[width=0.6\linewidth]{figures/sp/WPG_sp_ray.png}
    }
    
    \caption{The evaluation metrics on 100 checkpoints of Self play learning on Simple Overcooked!2.}
    \label{fig.IV}
\end{figure}

\clearpage
\section{Evaluation results on PBT}

\begin{figure}[htb]
    \centering
    \subfigure[Game Graph adj matrix]{
        \includegraphics[width=0.48\linewidth]{figures/pbt/G_adj_pbt.png}
    }
    \subfigure[Preference graph adj matrix]{
	\includegraphics[width=0.48\linewidth]{figures/pbt/PGG_adj_pbt.png}
    }
\end{figure}


\begin{figure}[htb]
    \centering
    \subfigure[WPG of agent1]{
	\includegraphics[width=0.45\linewidth]{figures/pbt/WPG_pbt_ray_0.png}
    }
    \subfigure[WPG of agent2]{
	\includegraphics[width=0.45\linewidth]{figures/pbt/WPG_pbt_ray_1.png}
    }
    
    \subfigure[WPG of agent3]{
	\includegraphics[width=0.45\linewidth]{figures/pbt/WPG_pbt_ray_2.png}}
    \subfigure[WPG of agent4]{
	\includegraphics[width=0.45\linewidth]{figures/pbt/WPG_pbt_ray_3.png}
    }
    \caption{The evaluation metrics on checkpoints of PBT learning on Simple Overcooked!2.}
    \label{fig.IV}
\end{figure}

\clearpage
\section{Evaluation results on Ours}

\begin{figure}[htb]
    \centering
    \subfigure[Game Graph adj matrix]{
        \includegraphics[width=0.48\linewidth]{figures/our/G_adj_our.png}
    }
    \subfigure[Preference graph adj matrix]{
	\includegraphics[width=0.48\linewidth]{figures/our/PGG_adj_our.png}
    }
\end{figure}


\begin{figure}[htb]
    \centering
    \subfigure[WPG]{
	\includegraphics[width=0.6\linewidth]{figures/our/WPG_our.png}
    }
    
    \caption{The evaluation metrics on checkpoints of Ours learning on Simple Overcooked!2.}
    \label{fig.IV}
\end{figure}

\clearpage
\section{Discussion and TODO}

\subsection{Discussion}
Now, we define the characteristic game function as :
$$v(C) = \sum_{i\in C}\sum_{j\in C} r(i,j).$$

So we could further calculate Shapley value by
$$
 \phi_i(G)=\frac{1}{n!}\sum_{\pi\in\Pi_\mathcal{N}} \Delta_\pi^G(i),
$$

where $\Delta_\pi^G(i)=v(S_\pi(i)\cup \{i\}) - v(S_\pi(i))$ is the marginal contribution of $i$.

If we use sum of each pair's rewards as value function, we could simplify the marginal contribution.

$$\Delta_\pi^G(i)=\sum_{j\in S_\pi(i)} R(i,j).$$

So the marginal contribution is the sum reward of policy $i$ with other policy in the coalition.

Besides, under this value function, the Shapley value of bigger coalition must be higher than small coalitions.




\textcolor{red}{How to define the value function?}

\subsubsection{TODO}
In this paper, I plan to organize it as following.

\begin{itemize}
    \item What is intransitivity in cooperative? and using game graph to unify and claim the conception.
    \item analyze intransitivity of existent methods like SP,PBT,FCP in overcooked!2 environment, and one zero-sum game (maybe alphazero on connect four)
    \item we propose Shapley value PBT, and prove it overcome  intransitivity in cooperative game. 
    \begin{itemize}
        \item higher reward
        \item the game graph without so many intransitivity, i.e., the indegree is as higher as possible.
        \item evaluation on our intransitivity metric with Shapley value.
    \end{itemize}
\end{itemize}

The core contribution is we present game graph to reclaim and analyze the intransitivity in both cooperative and competitive games, and we then propose a Shapley value PBT to solve the problem.

\textcolor{red}{1. network improvement from : centralrity; page rank; complex graph 2. regret : how to get optimal policy \\ 
first part : evaluation\\
second part: training}

TODO : complex network like centrality ; regret paper; reward normlized;

\clearpage






\clearpage
\section{Preliminaries}
\subsection{Advanced Probability Overview}
From CMU 36-753 Advanced Probability Overview.

\subsubsection{$\sigma$- fields/algebra}
\begin{definition}[fields and $\sigma$-fields]
Let $\Omega$ be a set. A collection $\mathcal{F}$ of subsets of $\Omega$ is called a field if it satisfies
\begin{itemize}
    \item $\Omega \in \mc{F}$,
    \item for each $A\in \mc{F}$, $A^C\in \mc{F}$,
    \item for all $A_1,A_2 \in \mc{F}, A_1 \cup A_2 \in \mc{F}$.
\end{itemize}
A field $\mc{F}$ is a $\sigma$-fields if, in addition, it satisfies: for every sequence $\{A_k\}_{k=1}^\infty A_k\in \mc{F}$.
\end{definition}

Measures on fields and $\sigma$-fields are defined.
\begin{definition}[Measurable Space]
    A set $\Omega$ together with a $\sigma$-field $\mc{F}$ is called a measurable space $(\Omega, \mc{F})$, and the elements of $\mc{F}$ are called measurable sets.
\end{definition}

\begin{definition}
    Let $(\Omega, \mc{F)}$ be a measurable space. Let $\mu : \mc{F} \longrightarrow \mathbb{\rm I\!R}^{+0}$ satisfy
    \begin{itemize}
        \item $\mu(\emptyset)=0$,
        \item for every sequence $\{A_k\}_{k=1}^\infty$ of mutually disjoint elements of $\mc{F}$, $\mu(\cup_{k=1}^\infty A_k)=\sum_{k=1}^\infty \mu(A_k)$,
    \end{itemize}
    where $\mathbb{\rm I\!R}^{+0}$ is nonnegative extended reals $[0,\infty]$.
    Then $\mu$ is called a measure on $(\Omega, \mc{F})$ and $(\Omega, \mc{F}, \mu)$ is a measure space. If $\mc{F}$ is merely a field, then a $\mu$ that satisfies the above two conditions whenever $\cup_{k=1}^\infty A_k \in \mc{F}$ is called a measure on the field $\mc{F}$.
\end{definition}

\begin{definition}[$\sigma$-finite measure]
Let $(\Omega, \mc{F}, \mu)$ be a measure space, and let $\mc{C}\in \mc{F}$. Suppose that there exists a sequence $\{A_n\}_{n=1}^\infty$ of elements of $\mc{C}$ such that $\mu(A_n)<\infty$ for all $n$ and $\Omega=\bigcup_{n=1}^\infty A_n$. Then we say that $\mu$ is $\sigma$-finite on $\mc{C}$. If $\mu$ is $\sigma$-finite on $\mc{F}$, we merely say that $\mu$ is $\sigma$-finite.
\end{definition}

\begin{definition}[Product $\sigma$-Field.]
Let $(\Omega_1,\mc{F}_1)$ and $(\Omega_2,\mc{F}_2)$ be measurable spaces, Let $\mc{F}_1 \bigotimes \mc{F}_2$ be the smallest $\sigma$-field of subsets of $\Omega_1 \times \Omega_2$ containing all sets of the form $A_1 \times A_2$ where $A_i\in \mc{F_i}$ for $i=1,2$. Then $\mc{F}_1 \bigotimes \mc{F}_2$ is the product $\sigma$-field. 
\end{definition}

\begin{theorem}[Product Measure.]
Let $(\Omega_i,\mc{F}_i, \mu_i)$ for $i=1,2$ be $\sigma$ -finite measure spaces. There exists a unique measure $\mu$ defined on $(\Omega_1 \times \Omega_2, \mc{F}_1 \bigotimes \mc{F}_2)$ that satisfies $\mu(A_1\times A_2)=\mu_1(A_1)\mu_2(A_2)$ for all $A_1\in \mc{F}_1$ and $A_2\in \mc{F}_2$.
\end{theorem}

\subsection{Normal Form Game Decomposition~\citep{Sung2016strategic}}

Consider a collection of measurable spaces $S_i$ with a $\sigma-$algebra for $i=1,\cdots,n.$ 
\
We then could define the following vector space of games:
\begin{equation}
    \mathcal{L} := \{f:S\longrightarrow \mathbb{R}^n \text{ measurable and }\|f\| < \inf\}
\end{equation}

\begin{definition}
    We define the following subspaces of $\mathcal{L}$:
    \begin{itemize}
        \item The space of \textit{identical interest games}, $\mathcal{I}$, is defined by
        $$\mathcal{I}:=\{f\in\mathcal{L}:f^{(i)}(s)=f^{(j)}(s)~ \text{for all}~ i,j ~\text{and for all}~ s\}.$$
        
        \item The space of zero-sum games, $\mc{Z}$, is defined by
        $$
        \mc{Z} := \{f\in \mc{L}:\sum_{l=1}^n f^{(l)}(s)=0 \text{\ for all } s\}.
        $$
        
        \item The space of normalized games, $\mc{N}$, is defined by 
        $$
        \mc{N} := \{f\in \mc{L}: \int f^{(i)}(t_i,s_{-i}) dm_i(t_i)=0\text{ for all } s_{-i},\text{ for all } i\}.
        $$ 
        note: A normalized game is a game in which the sum of one player’s payoffs, given the other players’ strategies, is always zero.
        
        \item The space of nonstrategic games, $\mc{E}$, is defined by
        $$
        \mc{E} := \{f\in \mc{L} : f^{(i)}(s_i,s_{-i})=f^{(i)}(s_i^{\prime}) \text{ for all } s_i,s_i^\prime, \text{ for all } i\}.
        $$
        note: A non-strategic game, (also sometimes called a passive game), is a game in which each player’s payoff does not depend on his own strategy choice. Thus, each player’s strategy choice plays no role in determining her payoff. Because of this property, the players’ strategic relations remain unchanged if we add the payoff of a non-strategic game to that of another game.
    \end{itemize}
\end{definition}

\begin{definition}
    We say that game $g$ is strategically equivalent to game $f$ if
    $$
    g=f+h \text{ for some } g\in \mc{E}.
    $$
    We write this relation as $g\sim f$.
    
    note: in two strategically equivalent games, strategic variables such as best responses of players are the same.
\end{definition}

Before giving the new game compositions, we firstly give the sum of two subspaces,
$$
A+A^\prime := \{f+f^\prime : f\in A, f^\prime \in A^\prime \}.
$$

\begin{definition}
    We have the following definitions:
    \begin{itemize}
        \item The space of potential games (identical interest equivalent games) is defined by
        $$
        \mc{I} + \mc{E}.
        $$
        
        \item The space of zero-sum equivalent games is defined by
        $$
        \mc{Z} + \mc{E}.
        $$
        
        \item The space of games that is strategically equivalent to both an identical interest game and a zero-sum game, called zero-sum equivalent potential games, is denoted by $\mc{B}$
        $$
        \mc{B} = (\mc{I} + \mc{E}) \cap (\mc{Z} + \mc{E}).
        $$
    \end{itemize}
\end{definition}

In the context of game theory, the following two kinds of decomposition receive much attention in the literature: (i) identical interest games versus zero-sum games~\citep{kalai2010cooperation} and (ii) normalized games versus non-strategic games~\citep{hwang2011decompositions}.
\begin{equation}
    (i) \mc{L} = \mc{I} \oplus \mc{E},\quad (ii) \mc{L} = \mc{N} \oplus \mc{E},
\end{equation}
where $\oplus$ denotes the direct sum in which every element in $\mc{E}$.

\subsection{Characteristic function game}

\begin{definition}
    A characteristic function game $G$ is given by a pair $(\mathcal{N}, v)$, where $\mathcal{N} = \{1,\cdots,n\}$ is a finite,non-empty set of agents and $v:2^\mathcal{N}\longrightarrow \mathbb{R}$  is a characteristic function, which maps each coalition $C\subseteq \mathcal{N}$ to a real number $v(C)$. The number $v(C)$ is usually referred to as the value of the coalition $C$.
\end{definition}

If characteristic function game $G$ holds the property that the coalitional value $v(C)$ can be divided amongst the members of $C$ in any way that the members of $C$ choose, the game $G$ is called \textbf{transferable utility games (TU games)}.

Convex game is a typical transferable utility game in the cooperative game theory. 
\begin{definition}
    \label{def:cg}
    A characteristic function $v$ is said to be supermodular if it satisfies
    $$
    v(C\cup D) + v(C\cap D) \geq v(C) + v(D)
    $$
    for every pair of coalitions $C,D\subseteq \mathcal{N}$. 
    A game with a super modular characteristic function is said to be convex.
\end{definition}

\begin{proposition}
    \label{prop:cg}
A characteristic function game $G=(\mathcal{N},v)$ is convex if and only if for every pair of coalitions $T,S$ such that $T\subset S$ and every player $i\in \mathcal{N} \backslash S$ it holds that
$$
v(S\cup \{i\}) - v(S) \geq v(T\cup \{i\}) - v(T).
$$
\end{proposition}

In other word, in a convex game, a player is more useful when she joins a bigger coalition.
In a convex game, its characteristic function $v$ satisfies two properties (necessary conditions of convex game):
\begin{itemize}
 \item $v(\mathcal{C}\cup \mathcal{D}) \geq v(\mathcal{C})\ + v(\mathcal{D}), \forall \mathcal{C,D}\subset \mathcal{N}, \mathcal{C}\cap \mathcal{D}=\emptyset.$
\item the coalitions are independent.
\end{itemize}


An outcome of $G$ is a pair $(CS,\mathbf{x})$, where $CS$ is a coalition structure over $G$ and $\mathbf{x}$ is a payoff vector for $CS$.
\begin{definition}[Coalition Structure]
    Given a characteristic function game $G=(\mathcal{N}, v)$, a coalition structure over $\mathcal{N}$ is a collection of non-empty subsets $CS=\{C^1,\cdots,C^k\}$ such that
    \begin{itemize}
        \item $\bigcup_{j=1}^k C^j = N$,
        \item $C^i\cap C^j=\emptyset$ for any $i,j\in \{1,\cdots,k\}$ such that $i\neq j$.
    \end{itemize}
\end{definition}

\begin{definition}[Payoff Vector]
    A vector $\mathbf{x}=(x_1,\cdots,x_n)\in \mathbb{R}^n$ is a payoff vector for a coalition structure $CS=\{C^1,\cdots,C^k\}$ over $N=\{1,\cdots,n\}$ if
    \begin{itemize}
        \item $x_i\geq 0$ for all $i\in \mathcal{N}$,
        \item $\sum_{i\in C^j} x_i\leq v(C^j)$ for any $j\in \{1,\cdots,k\}$.
    \end{itemize}
\end{definition}

Given a payoff vector $\textbf{x}$, $x(C)$ is the total payoff $\sum_{i\in C}x_i$ of a coalition $C\subseteq \mathcal{N}$ under $\textbf{x}$.

Furthermore, Shapley value is an important solution concept of cooperative game. To formally define the Shapley value, we need some additional notation. 
Given a characteristic function game $G=(\mathcal{N},v)$, let $\Pi$ denote the set of all permutations of $\mathcal{N}$, i.e., one-to-one mappings from $\mathcal{N}$ to itself.
Given a permutation $\pi \in \Pi_\pi$, we denote by $S_\pi(i)$ the set of all predecessors of $i$ in $\pi$, i.e., we set $S_\pi(i)=\{j\in\mathcal{N}| \pi(j)< \pi(i)\}$.

The marginal contribution of an agent $i$ with respect to a permutation $\pi$ in a game $G=(\mathcal{N}, v)$ is denoted by $\Delta_\pi^G(i)=v(S_\pi(i)\cup \{i\}) - v(S_\pi(i))$.
\begin{definition}
    Given a characteristic function game $G=(\mathcal{N}, v)$ with $|\mathcal{N}|=n$, the Shapley value of a player $i\in \mathcal{N}$ is denoted by 
    $$
    \phi_i(G)=\frac{1}{n!}\sum_{\pi\in\Pi_\mathcal{N}} \Delta_\pi^G(i).
    $$
\end{definition}

The Shapley value possesses four desirable properties:
\begin{itemize}
    \item Efficiency: $\sum_{i\in \mathcal{N}}\phi_i(G)=v(N)$, it distributes the value of the grand coalition among all agents;
    \item Dummy player: if $v(C)=v(C\cup \{i\})$ for any $C \subseteq \mathcal{N}$, then $i$ is called dummy player and $\phi_i(G)=0$;
    \item Symmetric: if $v(C\cup j)=v(C\cup \{i\})$ for any $C \subseteq \mathcal{N}$, then player $i$ and $j$ are symmetric and $\phi_i(G)=\phi_j(G)$.
    \item Additivity: $\phi_i(G^1+G^2)=\phi_i(G^1)+\phi_i(G^2)$ for all $i\in \mathcal{N}$.
\end{itemize}

Next, we consider how to define the coalitional stability.
\begin{definition}
    The core $\mathcal{C}(G)$ of a characteristic function game $G=(\mathcal{N},v)$ is the set of all outcomes $(CS,\mathbf{x})$ such that $x(C)\geq v(C)$ for every $C\subseteq \mathcal{N}$.
\end{definition}

Now,if $x(C) < v(C)$for some $C\subseteq \mathcal{N}$, the agents in $C$ could do better by abandoning the coalition structure $CS$ and forming a coalition of their own. 
On other word, the core of $G$ is the set of stable outcomes where no subset of players has incentive to deviate.

\begin{proposition}
If an outcome $(CS,\mathbf{x})$ is in the core of a characteristic function game $G=(\mathcal{N},v)$ then $v(CS)\geq v(CS^\prime)$ for every coalition structure $CS^\prime \in CS_\mathcal{N}$, where $CS_\mathcal{N}$ denotes the space of all coalition structures over $\mathcal{N}$.
\end{proposition}

A classic result by Shapley shows that convex games always have a non-empty core.
\begin{theorem}
If $G=(\mathcal{N},v)$ is a convex game, the $G$ has a non-empty core.
\end{theorem}

And Shapley value must be in the core for a convex game with the grand coalition~\citep{shapley1971}.

\begin{figure}
    \centering
    \includegraphics[width=0.7\textwidth]{figures/ECG.png}
    \caption{Take from paper~\citep{jianhong2019shapleyQ}}
    \label{fig:ECG}
\end{figure}

\subsection{Advanced Knowledge}
\subsubsection{Extended Convex Game}

Extended convex game: extend convex game to the scenarios with infinite horizons and decisions.
\begin{theorem}
\label{thm:ECG}
With the efficient payoff distribution scheme, for an extended convex game (ECG), one solution in the core must exist with the grand coalition and the objective is $\max_\pi v^\pi(\{\mc{N}\})$, which can lead to the maximal social welfare, i.e., $\max_\pi v^\pi(\{\mc{N}\}) \geq \max_\pi v^\pi(\{CS^\prime\})$ for every coalition structure $CS^\prime \in CS_\mc{N}$.
\end{theorem}

\begin{corollary}
\label{cor:ECG}
For an extended convex game (ECG) with the grand coalition, Shapley value must be in the core.
\end{corollary}


As seen from Theorem\ref{thm:ECG}, with an  appropriate efficient payoff distribution scheme, an ECG with the grand coalition is actually equivalent to a global reward game. 
Both of them aim to maximize the global value (i.e., the global re-ward). 
Here, we assume that \textbf{the agents in a global reward game are regarded as the grand coalition.}

\subsection{Approximations of the Shapley Value}
However, Shapley value computation requires an exponential number of value function evaluations, resulting in exponential time complexity. Therefore, how to approximate Shapley value is another important question.

\subsubsection{Monte Carlo Permutation Sampling}
\begin{algorithm}
\caption{Monte Carlo Permutation Sampling}\label{alg:mc_shapley}
\textbf{Input:} Cooperative TU game $(\mc{N},v)$; Number of sampled permutations $k$.

\textbf{Output:} Approximated Shapley value $\hat{\phi}_i, \forall i\in \mc{N}$.

$\hat{\phi}_i\longleftarrow 0, \forall i\in \mc{N}$ \\

\For{($1,\cdots,k$)}{
$\pi \longleftarrow$ Uniform Sample $(\Pi(\mc{N}))$\\
\For{$i\in \mc{N}$}{
    $S_{\pi}(i)\longleftarrow\{j\in\mathcal{N}| \pi(j)< \pi(i)\}$.\\
    $\hat{\phi}_i\longleftarrow \hat{\phi}_i + \frac{v(S_{\pi}(i)\cup \{i\})-v(S_{\pi}(i))}{k}.$
}
}
\end{algorithm}

\subsubsection{Multilinear Extension}
Shapley values for TU games can be approximated by solving a weighted least squares optimization problem. 
Let $w_C=\frac{|\mc{N}|-1}{\binom{|\mc{N}|}{|\mc{C}|}|\mc{C}|(|\mc{N}|-|\mc{C}| )}$, we have
\begin{equation}
    \begin{aligned}
    &\min_{\hat{\phi}_0,\cdots,\hat{\phi}_n} &  \sum_{C\subseteq\mc{N}}w_C(\hat{\phi}_0+\sum_{i\in C}\hat{\phi}_i - v(C)) \\
    &\quad s.t. &\hat{\phi}_0=v(\emptyset),\hat{\phi}_0+\sum_{i\in \mc{N} } \hat{\phi}_i = v(\mc{N}).
    \end{aligned}
\end{equation}


\subsubsection{PSRO-series Algoritm}

Let $\mc{N}=\{1,2,\cdots,n\}$ be a set of agents.
Denote by $S=\Pi_k S^k$ the space of pure strategy profiles.
And let $\mb{M}(s)=(\mb{M}^1(s),\dots,\mb{M}^n(s))\in \mathbb{R}^n$ denote the vector of expected player payoffs for each $s\in S$.

\begin{figure}[hb]
    \centering
    \includegraphics[width=\textwidth]{figures/PSRO.png}
    \caption{Overview of PSRO algorithm phases. Take from~\citep{paul2019generalized}.}
    \label{fig:psro}
\end{figure}

Figure~\ref{fig:psro} summarizes the PSRO-series algorithm as three iterated phases: complete  (Figure~\ref{fig:psro}(a)), solve (Figure~\ref{fig:psro}(b)) and expand (Figure~\ref{fig:psro}(c)).
In the complete phase, a meta-game consisting of all match-ups of these joint policies is synthesized, with missing payoff entries in $\mb{M}$ completed through game simulations.
Next, in the solve phase, a meta-solver $\mathcal{M}$ computes a profile $\pi$ over the player policies (e.g., Nash,$\alpha$-Rank, or uniform distributions).
Finally, in the expand phase, an oracle $\mathcal{O}$ computes at least one new policys' $i$ for each player $i\in \mc{N}$, given profile $\pi$. 


Specifically, for two-player zero-sum games, the goal of agent $v$ is to maximize $\phi(v,w)$. We always refer to $\phi >0, \phi < 0$, and $\phi=0$ as wins, losses and ties for $v$. Besides,$\phi$ is an antisymmetric function, $i.e.$, $\phi(v,w)=-\phi(w,v)$.
 
\begin{algorithm}
\caption{$\text{PSRO}_\text{N}$ algorithm}\label{alg:psro}
\For{$t=1,\cdots,$}{
\textbf{Complete}: simulation and compute payoff tensor $\mathbf{M}\in \mathbb{R}^{t\times t}$. \\
\textbf{Solve}: calculate meta-strategy $\pi$ via Nash meta-solver $\mc{M}$.\\
\textbf{Expand}: an oracle computes a new policy 
$\mathbf{v}_{t+1}\longleftarrow \text{oracle}(\mathbf{v}_t,\sum_{s_i\in S}\pi[i]\cdot \phi_{s_i}(\cdot))$,
then expand the policy space:
${S}\longleftarrow {S}\cup \{\mb{v}_{t+1}\}$
}
\end{algorithm}























\section{Introduction}
Zero-shot coordination (ZSC) is a major challenge of cooperative AI to train agents that have the ability to coordinate with a wide range of unseen partners~\citep{Legg2007Universal,Hu2020OtherPlayFZ}.
The traditional method of self-play (SP)~\citep{SP} involves iterative improvement of strategies by playing against oneself. 
While SP can converge to an equilibrium of the game~\citep{Fudenberg1998Theory}, the strategies often form specific behaviors and conventions to achieve higher payoffs~\citep{Hu2020OtherPlayFZ}. 
As a result, a fully converged SP strategy may not be adaptable to coordinating with unseen strategies~\citep{Adam2018Learning,Hu2020OtherPlayFZ}.

To overcome the limitations of SP, most ZSC methods focus on promoting strategic or behavioral diversity by introducing population-based training (PBT) to improve strategies' adaptive ability~\citep{HARL, Canaan2022Hanabi, MEP, TrajDi}. 
PBT aims to improve cooperative outcomes with other strategies in the population to promote zero-shot coordination with unseen strategies. 
This is achieved by maintaining a set of strategies to break the conventions of SP~\citep{SP} and optimizing the rewards for each pair in the population. 
Most state-of-the-art (SOTA) methods attempt to pre-train a diverse population~\citep{FCP, TrajDi} or introduce hand-crafted methods~\citep{Canaan2022Hanabi, MEP}, which are used to master cooperative games by optimizing fixed objectives within the population. 
These methods have shown to be efficacious in addressing intricate cooperative tasks such as Overcooked~\citep{HARL} and Hanabi~\citep{hanabi}.

However, when optimizing a fixed population-level objective, such as expected rewards within population~\citep{FCP,TrajDi,MEP} , the coordination ability of strategies within the population may not be improved. 
Specifically, while overall performance may improve, the coordination ability within the population may not be promoted in a simultaneous manner. 
This phenomenon, which we term ``\textit{cooperative incompatibility}", highlights the importance of considering the trade-offs between overall performance and coordination ability when attempting to optimize a fixed population-level objective.
 
In addressing the problem of cooperative incompatibility, we reformulate cooperative tasks as Graphic-Form Games (GFGs). 
In GFGs, strategies are characterized as nodes, with the weight of the edges between nodes representing the mean cooperative payoffs of the two associated strategies.
Additionally, by utilizing sub-graphs of GFGs referred to as preference Graphic-Form Games (P-GFGs), we are able to further profile each node's upper bound cooperative payoff within the graph, enabling us to evaluate cooperative incompatibility and identify strategies that fail to collaborate.
Furthermore, we propose the Cooperative Open-ended LEarning (\textbf{COLE}) framework, which iteratively generates a new strategy that approximates the best response to the empirical gamescapes of P-GFGs. 
We have proved that the {COLE} framework\xspace can converge to the optimal strategy with a Q-sublinear rate when using in-degree centrality as the preference evaluation metric. 

To propose {COLE} framework\xspace to address the phenomenon of cooperative incompatibility, we implement a practical algorithm \algo by combining the \textbf{S}hapley \textbf{V}alue solution~\cite{shapley1971} with our GFG. 
\algo comprises a simulator, a solver, and a trainer, specifically designed to master cooperative tasks with two players.
The solver, utilizing the development of the intuitive solution concept Shapley value, evaluates the adaptive ability of strategies and calculates the cooperative incompatibility distribution.
The trainer aims to approximate the best responses to the cooperative incompatibility distribution mixture in the most recent population. 
To evaluate the performance of the~\algo, we conducted experiments in Overcooked, a cooperative task environment~\citep{HARL}.

 We evaluated the adaptive ability of \algo by testing its performance against different level partners. 
 The middle-level partner is a commonly used behavior cloning model~\citep{HARL}, and the expert partners are strategies of current methods, i.e., SP, PBT, FCP, and MEP.

 The results of the experiments showed that \algo outperforms the recent SOTA methods in both evaluation protocols. 
 Additionally, through the analysis of GFGs and P-GFGs, the learning process of \algo revealed that the framework efficiently overcomes cooperative incompatibility. 
The contributions in this paper can be summarized as follows.
\begin{itemize}
\setlength{\itemsep}{1pt}
\setlength{\parskip}{1pt}
\setlength{\parsep}{1pt}
\vspace{-3.5mm}
    \item We introduce the concept of Graphic-Form Games (GFGs) and Preference Graphic-Form Games (P-GFGs) to intuitively reformulate cooperative tasks, which allows for a more efficient evaluation and identification of cooperative incompatibility during learning.
    \item We develop the concept of graphic-form gamescapes to help understand the objective and present the {COLE} framework\xspace to iteratively approximate the best responses preferred by most others.
    \item We prove that the algorithm will converge to the optimal strategy, and the convergence rate will be Q-sublinear when using in-degree preference centrality. Empirical experiments in the game Overcooked verify the proposed algorithm's effectiveness compared to SOTA methods.
\end{itemize}

\section{Related Works}

\textbf{Zero-shot coordination.} The goal of zero-shot coordination (ZSC) is to train a strategy that can coordinate effectively with unseen partners~\citep{Hu2020OtherPlayFZ}. 
Self-play~\citep{SP,HARL} is a traditional method of training a cooperative strategy, which involves iterative improvement of strategies by playing against oneself, but develops conventions between players and does not cooperate with other unseen strategies~\citep{Adam2018Learning,Hu2020OtherPlayFZ}. 
Other-play~\citep{Hu2020OtherPlayFZ} is proposed to break such conventions by adding permutations to one of the strategies.
However, this approach may be reduced to self-play if the game or environment does not have symmetries or has unknown symmetries. 
Another approach is population-based training (PBT)~\citep{PBT,HARL}, which trains strategies by interacting with each other in a population.
However, PBT does not explicitly maintain diversity and thus fails to  coordinate with unseen partners\citep{FCP}. 

To achieve the goal of ZSC, recent research has focused on training robust strategies that use diverse populations of strategies~\citep{FCP,TrajDi,MEP}. 
Fictitious co-play (FCP)~\citep{FCP} obtains a population of periodically saved checkpoints during self-play training with different seeds and then trains the best response to the pre-trained population. 
TrajeDi~\citep{TrajDi} also maintains a pre-trained self-play population but encourages distinct behavior among the strategies. 
The maximum entropy population (MEP)~\citep{MEP} method proposes population entropy rewards to enhance diversity during pre-training. It employs prioritized sampling to select challenging-to-collaborate partners to improve generalization to previously unseen policies. 
Furthermore, methods such as MAZE~\citep{Xue2022Heter} and CG-MAS~\citep{Mahajan2022Gen} have been proposed to improve generalization ability through coevolution and combinatorial generalization.
In this paper, we propose a {COLE} framework\xspace that could dynamically identify strategies that fail to coordinate due to cooperative incompatibility and continually poses and optimizes objectives to overcome this challenge and improve adaptive capabilities.

\textbf{Open-ended learning.}
Another related area of research is open-ended learning, which aims to continually discover and approach objectives~\citep{Srivastava2012Comtinually, Team2021OpenEndedLL,Meier2022Open}.
In MARL, most open-ended learning methods focus on zero-sum games, primarily posing adaptive objectives to expand the frontiers of strategies~\citep{psro,psrorn,pipelinepsro,yaodong_diverse,YingWen2021openenned,mcaleer2022self}.
In the specific context of ZSC, the MAZE method~\citep{Xue2022Heter} utilizes open-ended learning by maintaining two populations of strategies and partners and training them collaboratively throughout multiple generations. 
In each generation, MAZE pairs strategies and partners from the two populations and updates them together by optimizing a weighted sum of rewards and diversity. 
This method co-evolves the two populations of strategies and partners based on naive evaluations such as best or worst performance with strategies in partners.
Our proposed method, {COLE} framework\xspace, combines GFGs and P-GFGs in open-ended learning to evaluate and identify the cooperative ability of strategies to solve cooperative incompatibility efficiently with theoretical guarantee.

\begin{figure*}[ht!]
\includegraphics[width=0.95\linewidth]{figures/gfg-yang.pdf}
\centering
\caption{ {The Game Graph, (sub-) preference graph and corresponding preference centrality matrix.}
The (sub-) preference graphs are for all four iterations in the training process, and the corresponding preference in-degree centrality matrix is based on them.
 As can be observed in the ${\mathcal{G}}^\prime_3$ and ${\mathcal{G}}^\prime_4$, the newly updated strategies fail to be preferred by others and have centrality values of 1, despite an increase in the mean of rewards with all others. 
 In \textit{(b)}, we illustrate an ideal learning process in which a newly generated strategy can achieve higher outcomes with all previous strategies.
}
\vspace{-5mm}
\label{fig:game_graph}
\end{figure*}


\section{Preliminaries}
\textbf{Normal-form Game:}
A two-player normal-form game is defined as a tuple $(N, {\mathcal{A}}, {\mathbf{w}})$, where $N=\{1,2\}$ is a set of two players, indexed by $i$, ${\mathcal{A}}={\mathcal{A}}_1 \times {\mathcal{A}}_2$ is the joint action space, and ${\mathbf{w}}=(w_1, w_2)$ with $w_i: {\mathcal{A}} \rightarrow \mathbb{R}$ is a reward function for the player $i$. 
In a two-player common payoff game, two-player rewards are the same, meaning $w_1(a_1,a_2)=w_2(a_1,a_2)$ for $a_1,a_2 \in {\mathcal{A}}$.

\textbf{Empirical Game-theoretic Analysis (EGTA), Empirical Game and Empirical Gamescape.} EGTA is the study of finding meta-strategies based on experience with prior strategies~\citep{Walsh2002Analyzing, Karl2018Generalised}. 
An empirical game is built by discovering strategies and meta-reasoning about exploring the strategy space~\citep{NIPS2017_3323fe11}.
Furthermore, empirical gamescapes (EGS) are introduced to represent strategies in functional form games geometrically~\citep{psrorn}.
Given a population ${\mathcal{N}}$ of $n$ strategies, the empirical gamescapes is often defined as 
$
    {\mathcal{G}} := \{\text{convex mixture of rows of}~{\mathcal{M}} \},
$
where ${\mathcal{M}}$ is the empirical payoff table recording the expected outcomes for each joint strategy.


\textbf{Shapley Value.}
Shapley Value~\citep{shapley1971} is one of the important solution concepts for coalition games\citep{chalkiadakis2011computational,Bezalel2007intro}.
The Shapley Value aims to distribute fairly the collective value, like the rewards and cost of the team, of the team across individuals by each player's contribution.
Taking into account a coalition game $({\mathcal{N}},v)$ with a strategy set ${\mathcal{N}}$ and characteristic function $v$, the Shapley Value of a player $i\in {\mathcal{N}}$ could be obtained by 

\begin{equation}
\vspace{-1.5mm}
SV(i)=\frac{1}{n!}\sum_{\pi\in\Pi_\mathcal{N}} v(P_i^\pi \cup \{i\}) - v(P_i^\pi),
\label{eq:SV}
\end{equation}
where $\pi$ is one of the one-to-one permutation mappings from ${\mathcal{N}}$ to itself in the permutation set $\Pi$ and $\pi(i)$ is the position of player $i \in {\mathcal{N}}$ in permutation $\pi$. $P_i^\pi=\{j\in {\mathcal{N}} | \pi(j)<\pi(i)\}$ is the set of all predecessors of $i$ in $\pi$.



 

\section{Cooperative Open-Ended Learning}
In this section, we first introduce graphic-form games to intuitively reformulate cooperative games, then create an open-ended learning framework to solve cooperative incompatibility and further improve zero-shot adaptive ability.

\subsection{Graphic-Form Games (GFGs)} 
It is important to evaluate cooperative incompatibility and identify those failed-to-collaborate strategies to conquer cooperative incompatibility.
Therefore, we propose graphic-form games (GFGs) to reformulate normal-form cooperative games from the perspective of game theory and graph theory, which is the natural development of empirical games~\citep{psrorn}.
The definition of GFG is given below.

\begin{definition}[Graphic-Form Game]
    Given a set of parameterized strategies ${\mathcal{N}}=\{1,2,\cdots,n\}$, a two-player graphic-form game (GFG) is a tuple ${\mathcal{G}} = ({\mathcal{N}}, {\mathbf{E}}, {\mathbf{w}})$, which could be represented as a directed weighted graph.
    ${\mathcal{N}},{\mathbf{E}},{\mathbf{w}}$ are the set of nodes, edges, and weights, respectively.
    Given an edge $(i,j)$, ${\mathbf{w}}(i,j)$ represents the expected results of $i$ playing with $j$.
     The graphic representation of GFG is called a game graph.
     \vspace{-2mm}
\end{definition}
The payoff matrix of ${\mathcal{G}}$ is denoted as ${\mathcal{M}}$, where ${\mathcal{M}}(i,j)={\mathbf{w}}(i,j), \forall i,j \in {\mathcal{N}}$.
Our goal is to improve the upper bound of other strategies' outcomes in the cooperation within the population, which implies that the strategy should be preferred over other strategies.

Moreover, we propose preference graphic-form games (P-GFGs) as an efficient tool to analyze the current learning state, which can profile the degree of preference for each node in GFGs.
Specifically, P-GFG is a subgraph of GFG, where each node only retains the out-edge with maximum weight among all out-edges except for its self-loop.
Given a GFG $({\mathcal{N}}, {\mathbf{E}}, {\mathbf{w}})$, the P-GFG could be defined as ${\mathcal{G}}^\prime = \{{\mathcal{N}},{\mathbf{E}}^\prime, {\mathbf{w}}\}$, where ${\mathbf{E}}^\prime=\{(i,j) | \argmax_j {\mathbf{w}}(i, j), \forall j\in \{{\mathcal{N}}\backslash i\}, \forall i \in {\mathcal{N}}\}$ is the set of edges. The graphic representation of P-GFG is called a preference graph.

To deeply investigate the learning process, we further introduce the \textit{sub-preference graphs} based on P-GFGs, which aim to reformulate previous learning states and analyze the learning behavior of the algorithm.
Suppose that there is a set of sequentially generated strategies ${\mathcal{N}}_n=\{1,2,\cdots,n\}$, where the index also represents the number of iterations for simplicity.
For each previous iteration $i<n$, the sub-preference game form graph is denoted as $\{{\mathcal{N}}_i, {\mathbf{E}}^\prime_i,\mb{w}_i\}$, where ${\mathcal{N}}_i=\{1,2,\cdots,i\}$ is the set of strategies in iteration $i$, and ${\mathbf{E}}^\prime_i, and\ \mb{w}_i$ are the corresponding edges and weights.

The semantics of the preference graph is that a strategy or node $i$ prefers to play with the tailed node to achieve the highest results.
In other words, the more in-edges one node has, the more cooperative ability this node can achieve.
Ideally, if one strategy can adapt well to all others, all the other strategies in the preference graph will point to this strategy.
To evaluate the adaptive ability of each node, the centrality concept is introduced into the preference graph to evaluate how a node is preferred.
\begin{definition}[Preference Centrality]
    Given a P-GFG $\{{\mathcal{N}},E^\prime, {\mathbf{w}}\}$, preference centrality of $i\in {\mathcal{N}}$ is defined as,
    $$
    \eta(i)=1- \operatorname{norm}(d_i),
       
    $$
    where $d_i$ is a graph centrality metric to evaluate how the node is preferred, and $\operatorname{norm}:=\mathbb{R}\rightarrow [0,1]$ is a normalization function.
    \vspace{-2mm}
\end{definition}

Note that the $d$ is a kind of centrality that could evaluate how much a node is preferred.
A typical example of $d$ is the centrality of degrees, which calculates how many edges point to the node. 

Fig.~\ref{fig:game_graph} is an example of a common payoff game, showing the game graph, (sub-)preference graphs, and the preference centrality matrix for four sequentially generated strategies.
Note that in the corresponding sub-preference graphs, the updated strategies fail to improve the outcome of others after the second iteration,
and the preference centrality matrix also shows the same results.
The example shows an existing cooperative incompatibility that presents as the value of $\eta$ is kept at 1 in the matrix, meaning no nodes want to collaborate with the updated strategies.
Ideally, all the other strategies should prefer latest strategy (Fig.~\ref{fig:game_graph} (b)) which means the 
monotonic improvement of cooperative ability.
\begin{figure}[t!]
\centering
\includegraphics[width=0.7\linewidth]{figures/MEP_eta.pdf}
\caption{
The payoff matrix of each strategy during training and the corresponding preference centrality matrix of the MEP algorithm in the Overcooked. The darker the color in the payoff matrix, the higher the rewards. The darker the color in the preference centrality matrix, the lower the centrality value, and the more other strategies prefer it.
}
\label{fig:mep_eta}
\vspace{-4mm}
\end{figure}


Moreover, the analysis of the MEP algorithm, as shown in Fig.~\ref{fig:mep_eta}, discloses a cooperative incompatibility in the learning process in Overcooked environment~\citep{HARL}.
In the preference indegree centrality matrix, a strategy is preferred by more strategies if its color is darker.
In the learning process of MEP, although the mean rewards are always improving (as shown in the upper-right of Fig.~\ref{fig:mep_eta}), serious cooperative incompatibility problems occur after a period of training, where more strategies prefer to play with some previous strategies with a darker color rather than new strategies to obtain higher rewards.
\begin{figure*}[ht!]
    \centering
\includegraphics[width=0.85\linewidth]{figures/method/framework-crop.pdf}
\vspace{-0.3cm}
    \caption{
    An overview of one generation in {COLE} framework\xspace: The solver derives the cooperative incompatible distribution $\phi$ using a cooperative incompatibility solver, which can be any algorithm that evaluates cooperative contribution. The trainer then approximates the relaxed best response by optimizing individual and cooperative compatible objectives. The oracle's training data is generated using partners selected based on the cooperative incompatibility distribution and the agent's strategy. Finally, the approximated strategy $s_{n+1}$ is added to the population, and the next generation begins.
    }
    \label{fig:cole}
    \vspace{-3mm}
\end{figure*}
\vspace{-1mm}
\subsection{Cooperative Open-Ended Learning Framework}
\vspace{-1mm}
To tackle cooperative incompatibility by understanding the objective, we develop empirical gamescapes ~\citep{psrorn} for GFGs, which geometrically represent strategies in graphic-form games.
Given a GFG $\{{\mathcal{N}}, {\mathbf{E}}, {\mathbf{w}}\}$, the empirical gamescapse (EGS) is defined as 
\begin{equation}
\vspace{-1pt}
    \Bar{{\mathcal{G}}} := \left\{\text{convex mixture of rows of } {\mathcal{M}} \right\}.
\vspace{-2mm}
\end{equation}

However, learning directly with EGS to cooperate with these well-collaborated strategies is inefficient in improving adaptive ability.
To conquer cooperative incompatibility, the natural idea is to learn with the mixture of cooperative incompatible distribution on the most recent population ${\mathcal{N}}$.
Given a population ${\mathcal{N}}$, we present \textit{cooperative incompatible solver} to assess how strategies collaborate, especially with those strategies that are difficult to collaborate with.
The solver derives the cooperative incompatible distribution $\phi$, where strategies that do not coordinate with others have higher probabilities.

We also optimize the cooperative incompatible mixture over the individual objective, which is the cumulative self-play rewards to improve the adaptive ability with expert partners.
To simplify, we name it the individual and cooperative incompatible mixture (IPI mixture).
We use an approximate oracle to approach the best response over the IPI mixture.
Given strategy $s_n$, the oracle returns a new strategy $s_{n+1}$ :
$
    s_{n+1} = \operatorname{oracle}(s_{n+1}, {\mathcal{J}}(s_{n}, \phi)),
$
with $\eta(s_{n+1})=0$ , if possible. 
${\mathcal{J}}$ is the objective function as follows,
\begin{equation}
\vspace{-1pt}
    \label{eq:obj}
    {\mathcal{J}}(s_n,\phi) = \mathbb{E}_{p\sim \phi}{\mathbf{w}}(s_n,p) + \alpha {\mathbf{w}}(s_n,s_n),
\vspace{-1pt}
\end{equation}
where $\alpha$ is the balance hyperparameter.
The objective consists of the cooperative compatible objective and the individual objective.
The cooperative compatible objective aims to train the best response to those failed-to-collaborate strategies, and the individual objective aims to improve the adaptive ability with expert partners.
We call the best response the best-preferred strategy if $\eta(s_{n+1})=0$. 

However, arriving at the best-preferred strategy with $\eta(s_{n+1})=0$ is hard or even impossible.
Therefore, we seek to approximate the best-preferred strategies by relaxing the best strategy to the strategy whose preference centrality ranks top $k$.
The approximate oracle could be rewritten as $
    s_{n+1} = \operatorname{oracle}(s_n, {\mathcal{J}}(s_n, \phi)),
$
with $\mathcal{R}(\eta(s_{n+1}))>k$.

We extend the approximated oracle to open-ended learning and propose {COLE} framework\xspace (Fig.~\ref{fig:cole}).
The {COLE} framework\xspace iteratively updates new strategies that approximate the best-preferred strategies to the cooperative incompatible mixture and the individual objective.
The simulator completes the pay-off matrix with the newly generated strategy and others in the population.
The solver aims to derive the cooperative incompatible distribution of the Game Graph builder and the cooperative-incompatible solver.
The trainer uses the oracle to approximate the best-preferred strategy to the cooperative incompatible mixture and individual objective and outputs a newly generated strategy which is added to the population for the next generation.

Although we relax the best-preferred strategy to the strategy in the top $k$ centrality in the constraint, {COLE} framework\xspace still converges to a local best-preferred strategy with zero preference centrality.
Formally, the local best-preferred strategy convergence theorem is given as follows.
 \begin{restatable}{theorem}{FirstTHM}
    \label{thm: converge}
Let $s_0\in {\mathcal{S}}$ be the initial strategy and $s_i=\operatorname{oracle}(s_{i-1})$ for $i \in \mathbb{N}$.
    Under the assumption that $\lim_{n\rightarrow \infty}{\mathcal{J}}(s_n,\phi)\geq {\mathcal{J}}(s_{n-1},\phi)$ holds, we can say that the sequence $\{s_i\}$ for ${i\in \mathbb{N}}$ converges to a local optimal strategy $s^*$, i.e., the local best-preferred strategy.
 \end{restatable}
 \begin{proof} 
 \vspace{-3mm}
 See Appendix~\ref{appendix:proofs_thm}.
 \vspace{-3mm}
\end{proof}
\vspace{-1mm}
Besides, if we choose in-degree centrality as the preference centrality function, the convergence rate of {COLE} framework\xspace is Q-sublinear.
\begin{restatable}{corollary}{FirstLEMMA}
    \label{lemma: converge_rate}
Let $\eta: {\mathcal{G}}^\prime \rightarrow \mathbb{R}^n$ be a function that maps a P-GFG to its in-degree centrality, the convergence rate of the sequence $\{s_i\}$ is Q-sublinear concerning $\eta$.
 \end{restatable}
\begin{proof}
 \vspace{-3mm}
See Appendix~\ref{appendix:proofs_corollary}.
\vspace{-3mm}
\end{proof}

\vspace{-1mm}
\section{Practical Algorithm}
\label{sec:COOL}
To address common-payoff games with two players, we implemented \algo, where SV refers to \emph{Shapley Value}, based on {COLE} framework\xspace that can overcome cooperative incompatibility and improve zero-shot coordination capabilities, focusing on the solver and trainer components.
As shown in Fig.~\ref{fig:cole}, at each generation, \algo inputs a population ${\mathcal{N}}$ and generates an approximate best-preferred strategy added to ${\mathcal{N}}$ to expand the population.
The simulator calculates the payoff matrix ${\mathcal{M}}$ for the input population ${\mathcal{N}}$. 
Each element ${\mathcal{M}}(i,j)$ for $i,j\in {\mathcal{N}}$ represents the cumulative rewards of the players $i$ and $j$ at both starting positions. 
The solver evaluates and identifies failed-to-collaborate strategies by calculating the incompatible cooperative distribution. 
To effectively evaluate the cooperative ability of each strategy with all others, we incorporate weighted PageRank (WPG)~\citep{Xing2004Weighted} from graph theory into the Shapley Value to evaluate adaptability, particularly with failed-to-collaborate strategies. 
The trainer then approximates the best-preferred strategy over the recent population.



\vspace{-3mm}
\subsection{Solver: Graphic Shapley Value}
To approximate the best-preferred strategies over the recent population and overcome cooperative incompatibility, we need to calculate the cooperative incompatible distribution as the mixture.
In this paper, we combine the Shapley Value~\citep{shapley1971} solution, an efficient single solution concept for cooperative games to assign the obtained team value across individuals, with our GFG to evaluate and identify the strategies that did not cooperate.
To apply the Shapley Value, we define an additional characteristic function to evaluate the value of the coalition.
Formally, given a coalition $C\subseteq {\mathcal{N}}$, we have the following:
  $  v(C) = \mathbb{E}_{i\sim C,j\sim C}\sigma(i)\sigma(j){\mathbf{w}}(i,j),$
where $\sigma$ is a mapping function that evaluates how badly a node performs on its game graph.
We use the characteristic function to evaluate the coalition value of how it could cooperate with those hard-to-collaborate strategies.

We take the inverse of WPG~\citep{Xing2004Weighted} on the game graph as the metric $\sigma$.
WPG is proposed to assess the popularity of a node in a complex network.
The formula of WPG is given as follows:
$    \hat\sigma(u)=(1-d) + d \sum_{v\in B(u)} \sigma(v) \frac{I_u}{\sum_{p\in R(v)} I_p} \frac{O_u}{\sum_{p\in R(v)}O_p},    
       
        $
       
where $d$ is the damping factor set to $0.85$, $B(u)$ is the set of nodes that point to $u$, $R(v)$ denotes the nodes to which $v$ is linked, and $I, O$ are the degrees of inward and outward of the node, respectively.
Therefore, the metric $\sigma$ evaluates how unpopular a node is and is equal to the inverse of the WPG value $\hat\sigma$.

Then we calculate the Shapley Value of each node by taking a characteristic function in equation~\ref{eq:SV}, named the graphic Shapley Value.
We utilize the Monte Carlo permutation sampling~\citep{Castro2009PolynomialCO} to approximate the Shapley Value, which can reduce the computation complexity from exponential time to linear time.
After inverting the probabilities of the graphic Shapley Value, we get the cooperative incompatible distribution $\phi$, where strategies that fail to collaborate with others have higher probabilities.
We provide the Graphic Shapley Value algorithm in Appendix~\ref{appedix:solver}.

\subsection{Trainer: Approximating Best-preferred Strategy}

The trainer takes the cooperative incompatible distribution $\phi$ as input and samples its teammates to learn to approach the best-preferred strategy on the IPI mixture.

Recall the oracle for $s_n$ : 
$
    s_{n+1} = \operatorname{oracle}(s_{n+1}, {\mathcal{J}}(s_{n}, \phi)),
$
with $\mathcal{R}(\eta(s_{n+1}))>k$.
\algo aims to optimize the best-preferred strategy over the IPI mixture. 
The ${\mathcal{J}}(s_{n}, \phi)$ is the joint objective that consists of individual and cooperative compatible objectives.
The individual objective aims to improve the performance within itself and promote the adaptive ability with expert partners, formulated as follows:
$
    {\mathcal{J}}_i(s_n) = {\mathbf{w}}(s_n,s_n),
$
where $s_n$ is the strategy named ego strategy that needs to optimize in generation $n$.

And the cooperative compatible objective aims to improve cooperative outcomes with those failed-to-collaborate strategies:
$
    {\mathcal{J}}_{c} = \mathbb{E}_{p\sim \phi}{\mathbf{w}}(s_n,p),
$
where the objective is the expected rewards of $s_n$ with cooperative incompatible distribution-supported partners.
${\mathbf{w}}$ estimates and records the mean cumulative rewards of multiple trajectories and starting positions.
The expectation can be approximated as:
${\mathcal{J}}_{c} = \sum^b_{p\sim \phi}\phi(p){\mathbf{w}}(s_t,p),
$
where $b$ is the number of sampling times.

\label{sec:solver}
\begin{algorithm}[t!]
\caption{\algo Algorithm}
\label{alg:cool}
\begin{algorithmic}[1]
\STATE \algorithmicrequire population ${\mathcal{N}}_{0}$, the sample times $a,b$ of ${\mathcal{J}}_i,{\mathcal{J}}_c$, hyperparameters $\alpha,k$

\FOR{$t = 1,2,\cdots, $}
    \COMMENT{Step 1: Completing the payoff matrix}
    \STATE ${\mathcal{M}}_n \leftarrow \operatorname{Simulator}({\mathcal{N}}_t)$
    \COMMENT{Step 2: Solving the cooperative incompatibility distribution}
    \STATE $\phi = \operatorname{Graphic\ Shapley\ Value}({\mathcal{N}}_t)$ by Algorithm~\ref{algo:solver}
    \COMMENT{Step 3: Approximate the best-preferred strategy}
    \STATE ${\mathcal{J}} = \sum^b_{p\sim \phi}\phi(p){\mathbf{w}}(s_t,p) + \alpha \sum^a{\mathbf{w}}(s_t,s_t)$, where $s_t={\mathcal{N}}_t(t)$, $\phi$ is updated each time by Eq~\ref{eq:SUCG}
    \STATE {$s_{t+1} = \operatorname{oracle}(s_t, {\mathcal{J}})$} with $\mathcal{R}(\eta(s_{n+1}))>k$
    \COMMENT{Step 4: Expand the population}
    \STATE ${\mathcal{N}}_{t+1} = {\mathcal{N}}_{t} \cup \{s_{t+1}\}$ 
    \ENDFOR
\end{algorithmic}
\end{algorithm}
\setlength{\textfloatsep}{3mm}
\vspace{-1mm}

To balance exploitation and exploration as the learning continues, we present the Sampled Upper Confidence Bound for Game Graph (SUCG) that combines the Upper Confidence Bound (UCB) and GFG to control the sampling for more strategies with higher probabilities or new strategies.
Additionally, we view the SUCG value as the probability of sampling teammates instead of using the maximum item in typical UCB algorithms.
Specifically, in the game graph, we keep the information on the times that a node has been visited.
Therefore, the probability of each node considers both the Shapley Value and visiting times, denoted as $\hat{p}$.
The SUCG for any node $u$ in ${\mathcal{N}}$ could be calculated as follows:
\begin{equation}
    \hat\phi(u) = \phi(u) + c\frac{\sqrt{\sum_{i\in {\mathcal{N}}} \mb{N}(i)}}{1+\mb{N}(u)},
    \label{eq:SUCG}
   
\end{equation}
where $c$ is a hyperparameter that controls the degree of exploration and $\mb{N}(i)$ is the visit times of node $i$.
SUCG could efficiently prevent \algo from generating data with a few fixed strategies that did not cooperate, which could lead to a loss of adaptive ability.

We conclude the \algo as Algorithm~\ref{alg:cool}.
Moreover, to verify the influence of different ratios of two objectives, we denote \algo with different ratios as 0:4, 1:3, 2:2, and 3:1.
Specifically, \algo with $a:b$ represents different partner sampling ratios for the combining objective, where $a$ is the corresponding times to generate data using self-play for the individual objective, and $b$ is the number of sampling times in ${\mathcal{J}}_c$.
For example, \algo 1:3 trains by using self-play once, and sampling from the cooperative incompatible distribution as partners three times to generate data and update objectives.
\vspace{-2mm}
\section{Experiments}
\label{sec:exp}
\vspace{-1mm}
\subsection{Environment and Experimental Setting}
\vspace{-1mm}
In this paper, we conduct a series of experiments in the Overcooked environment~\citep{HARL,charakorn2020investigating,knott2021evaluating}.
The details of the Overcooked environment can be found in Appendix~\ref{appedix:layouts}.
We construct evaluations with different ratios between individual and cooperative compatible objectives, such as 0:4, 1:3, 2:2, and 3:1. 
These studies demonstrate the effectiveness of optimizing both individual and cooperative incompatible goals. 
We also compare our method with other methods, including self-play~\citep{SP,HARL}, PBT~\citep{PBT,HARL}, FCP~\citep{FCP}, and MEP~\citep{MEP}, all of which use PPO~\citep{PPO} as the RL algorithm. 
To thoroughly assess the ZSC ability, we evaluated the algorithms with unseen middle-level and expert partners. 
We use the human proxy model $H_{proxy}$ proposed by Carroll et al.\citep{HARL} as middle-level partners and the models trained with baselines and \algo as expert partners. 
The mean of the rewards is recorded as the performance of each method in collaborating with expert teammates. 
In the case study, we analyze the learning process of \algo, which shows that our method overcomes cooperative incompatibility. 
Furthermore, we visualize the trajectories with different ratios and play with expert teammates to analyze how the ratios affect the learned strategies. 
Appendix~\ref{appendix:cole} and Appendix~\ref{appendix:base} give details of the implementation of \algo and baselines.

\vspace{-2mm}
\subsection{Combining Objectives' Effectiveness Evaluation}
\begin{figure}[t!]
    \centering
    \includegraphics[width=0.95\linewidth]{figures/ablation_exp1.pdf}
    \caption{{The result of the combining objectives' effectiveness evaluation.}
    Mean episode rewards over 400 timesteps trajectories for \algo s with different objective ratios 0:4, 1:3, 2:2, and 3:1, paired with the unseen middle-level partner $H_{proxy}$.
    The gray bars behind present the rewards of self-play.
    }
    \label{fig:exp_ablation}
\vspace{-2mm}
\end{figure}



\vspace{-1mm}
This section evaluated the effectiveness of different objective ratios, including 0:4, 1:3, 2:2, and 3:1 of two objectives.
We divided each training batch into four parts, the ratio indicating the proportion of data generated by self-play and data generated by playing with strategies from the cooperative incompatible distribution. 
We omitted the 4:0 ratio as it would result in the framework degenerating into self-play.
Fig.~\ref{fig:exp_ablation} shows the mean rewards of episodes over 400 time steps of gameplay when paired with the unseen middle-level partner $H_{proxy}$ \citep{HARL}. 
We found that \algo with ratios 0:4 and 1:3 achieved better performance than the other ratios. 
In particular, \algo, with a ratio of 1:3, outperformed the other methods in the Cramped Room, Coordination Ring, and Counter Circuit layouts. 
On the Forced Coordination layout, which is particularly challenging for cooperation due to the separated regions, all four ratios performed similarly on average across different starting positions.  
Interestingly, \algo with only the cooperative compatible objective (ratio 0:4) performed better on the Asymmetric Advantages and Forced Coordination layouts when paired with the middle-level partner.
We discuss this phenomenon further in Section \ref{different_levels}.
The effectiveness evaluations indicate that combining individual and cooperatively compatible objectives is crucial to improving performance with unseen partners.
In general, we choose the ratio of 1:3 as the best choice.
\vspace{-2mm}
\subsection{Evaluation with Different Levels of Partners}
\label{different_levels}
\begin{figure}[t!]
    \centering
    \includegraphics[width=0.95\linewidth]{figures/main_exp.pdf}
    \caption{{Performance with middle-level partners.}
    The performance of \algo with middle-level partners is presented in terms of mean episode rewards over 400 timesteps trajectories for different objective ratios of 0:4 and 1:3, when paired with the unseen middle-level partner $H_{proxy}$. The results include the mean and standard error over five different random seeds. The gray bars indicate the rewards obtained when playing with themselves; the hashed bars indicate the performance when starting positions are switched.
    }
    \label{fig:main_exp}
\vspace{-2mm}
\end{figure}

To thoroughly evaluate the zero-shot cooperative ability of all methods, we adopted two sets of evaluation protocols. 
The first protocol involves playing with a trained human model $H_{proxy}$ trained in behavior cloning
However, due to the quality and quantity of human data used for behavior cloning to train the human model is limited, the capabilities of the human proxy model can only be classified as middle-level. 
Therefore, we use an additional evaluation protocol to coordinate with unseen expert partners. 
We selected the best models of our reproduced baselines and \algo 0:4 and 1:3 as expert partners.

Fig.~\ref{fig:main_exp} presents the performance of SP, PBT, MEP, and \algo with 0:4 and 1:3 when cooperating with middle-level partners. 
We observed that different starting positions on the left and right in asymmetric layouts resulted in significant performance differences for the baselines. 
For example, in the Asymmetric Advantages, the cumulative rewards of all baselines in the left position were nearly one-third of those in the right position. 
On the contrary, \algo performed well at the left and right positions.



As shown in Fig.~\ref{fig:main_exp}, \algo outperforms other methods in all five layouts when paired with the middle-level partner-human proxy model. 
Interestingly, \algo 0:4 with only the cooperatively compatible objective achieves better performance than \algo 1:3 on some layouts, such as Asymmetric Advantages. 
However, the self-play rewards of \algo 0:4 are much lower than \algo 1:3 and even other baselines. 
Furthermore, the performance with unseen experts of \algo 0:4 as shown in Table~\ref{tab:exp_expert}, is sometimes lower than the baselines.
We visualize the trajectories in the evaluation at the expert level and provide further analysis to explain this situation in Appendix~\ref{Trajectory}.


\begin{table}[t!]
\caption{Performance with expert partners. Mean episode rewards over 1 min trajectories for baselines and \algo with ratio 0:4, 1:3.
  Each column represents a different expert group, in which the result is the mean reward for each model playing with all others.
  }
\label{tab:exp_expert}
\begin{center}
\resizebox{\linewidth}{!}{%
\begin{sc}
    \begin{tabular}{lcccccc}
\toprule
\multirow{2}{*}{\textbf{Layout}} &\multirow{2}{*}{\textbf{Ratio}} & \multicolumn{4}{c}{\textbf{Baselines}}  &\multirow{2}{*}{\textbf{COLEs}}\\
\cline{3-6}\
 && \textbf{SP} & \textbf{PBT} & \textbf{FCP} & \textbf{MEP} &  \\
\midrule
\multirow{2}{*}{\textbf{Cramped Rm.}} &0:4&
153.00 & 198.50  & {199.83 } & 178.83 & 169.76 \\&1:3& 
165.67 & 209.83 & 207.17 & 196.83 & \textbf{212.80}\\
\hline
\multirow{2}{*}{\textbf{Asymm.Adv.}} &0:4&
108.17  & 164.83 & 175.50 & 179.83& \textbf{182.80}
\\&1:3&
108.17 & 161.50 & 172.17 & {179.83} & 178.80\\
 \hline
\multirow{2}{*}{\textbf{Coord. Ring}}&0:4&
132.00 & 106.83 & {142.67} & 130.67  & 118.08
\\&1:3&
133.33 & 158.83 & 144.00  & 124.67 &\textbf{166.32}\\
 \hline
\multirow{2}{*}{\textbf{Forced Coord.}} &0:4&
~~58.33 & ~~61.33 & ~~50.50  & ~~{79.33} &  ~~46.40\\&1:3&
~~61.50  & ~~70.33 & ~~62.33  & ~~38.00  &~~\textbf{86.40}\\
 \hline
\multirow{2}{*}{\textbf{Counter Circ.}}&0:4&
~~44.17  & ~~48.33 & ~~60.33& ~~21.33 & ~~{90.72}\\
&1:3&
~~65.67  & ~~64.00  & ~~46.50  & ~~76.67  &  \textbf{105.84}
\\
\bottomrule
\end{tabular}
\end{sc}
}
\end{center}
\vspace{-1mm}
\end{table}
Table~\ref{tab:exp_expert} presents the outcomes of each method when cooperating with expert partners. 
Each column in the table represents different expert groups, including four baselines and one \algo with a ratio of 0:4 or 1:3. 
The last column, labeled ``COLEs," represents the mean rewards of the corresponding \algo when working with other baselines. 
The table displays the mean cumulative rewards of each method when working with all other models in the expert group. 
The results indicate that \algo 1:3 outperforms the baselines and \algo 0:4, except in the layout of Asymmetric Advantages.
In the Asymmetric Advantages, \algo 0:4 only achieved a four-point victory over \algo 1:3, which can be considered insignificant considering the margin of error. 
In the other four layouts, the rewards obtained by \algo 1:3 while working with expert partners are significantly higher than those of \algo 4:0 and the baselines.

Our results suggest that \algo 1:3 has a stronger adaptive ability with different levels of partners. Furthermore, individual objectives are crucial in zero-shot coordination with expert partners. 
In conclusion, \algo 1:3 is more robust and flexible in real-world scenarios when working with partners of different levels.

\vspace{-2mm}
\subsection{Effectively Conquer Cooperative Incompatibility}
\label{casestudy} 
\begin{figure}[t!]
\centering    \includegraphics[width=0.77\linewidth]{figures/cole_analysis.pdf}
    \caption{
    {The learning process analysis of \algo 1:3.}
The darker-colored element on the left represents higher rewards, while the darker-colored element on the right represents lower centrality. The clustering of darker-colored areas around the diagonal on the right indicates that the new strategy adopted in each generation is preferred by most strategies, thus overcoming the cooperative incompatibility.
}
\label{fig:cole_analysis}
\vspace{-2mm}
\end{figure}
\vspace{-1mm}
In our analysis of the learning process of \algo 1:3 in the Overcooked environment, as shown in Fig.~\ref{fig:cole_analysis}, we observe that the method effectively overcomes the problem of cooperative incompatibility. 
The figure on the left in Fig.~\ref{fig:cole_analysis} shows the payoff matrix of 50 uniformly sampled checkpoints during training, with the upper left corner representing the starting point of training. 
Darker red elements in the payoff matrix indicate higher rewards. 
The figure on the right displays the centrality matrix of preferences, which is calculated by analyzing the learning process. 
Unlike the payoff matrix, the darker elements in the centrality matrix indicate lower values, indicating that more strategies prefer them in the population. 
As shown in the figure, the darker areas cluster around the diagonal of the preference centrality matrix, indicating that most of the others prefer the updated strategy of each generation. 
Thus, we can conclude that our proposed \algo effectively overcomes the problem of cooperative incompatibility.

\vspace{-3mm}
\section{Conclusion}
\vspace{-1mm}
In this paper, we propose graphic-form games and preference graphic-form games to intuitively reformulate cooperative games, which can efficiently evaluate and identify cooperative incompatibility. Furthermore, we develop empirical gamescapes for GFG to understand the objectives and present {COLE} framework\xspace to iteratively approximate the best response preferred by most others over the most recent population. 
Theoretically, we prove that {COLE} framework\xspace converges to the optimal strategy preferred by all others. 
Furthermore, if we choose the in-degree centrality as the preference centrality function, the convergence rate would be Q-sublinear. 
Empirically, our experiments on the Overcooked environment show that our algorithm \algo outperformed SOTA ones and that \algo efficiently overcame cooperative incompatibility. We include limitations and future work in Appendix~\ref{appendix:limits}.

\clearpage


\subsection{Graph Theory}
\textbf{Graph:} A weighted digraph $G$ could be presented as $(V, E, w)$, where $V$ is a set of vertices, $E$ is a set of edges with ordered pairs $(i, j)$ of nodes in $V$, and $w: E\rightarrow \mathbb{R}$ is a weight function from edges to be real numbers.
\ 
If 1) $(u,v)\in E$ implies $(v,u)\in E$ and 2) $w((u,v)) = w((v,u))$, $G$ is weighted undirected graph.
\ 
Besides, for all $e\in E$, if we have $w(e)=1$, the graph $G$ is said to be unweighted.
\\
\textbf{Centrality:} Centrality is a fundamental concept in graph theory, which measure the importance/influence/priority of a node in the graph. Degree centrality is one of the best-known measures of centrality~\citep{Linton1978Centrality}, which ranks nodes by the number of neighbours they have. 
\ 
Specifically, in-degree centrality refers to the number of in-linked neighbours incident upon a given node, and out-degree centrality is the number of out-linked neighbours.
\\
\textbf{Weighted PageRank:} PageRank is an algorithm to measure the importance of the pages in Web Structure Mining.
\ 
As a development of PageRank, weighted PageRank assigns larger rank values according to their popularity~\citep{Xing2004Weighted}.
\
The weighted PageRank of node $u\in {\mathcal{N}}$ is given as
\begin{equation}
    \operatorname{PR}(u) = (1-d) + d\sum_{v\in B(u)} \operatorname{PR}(v) W^{in}_{(v,u)} W^{out}_{(v, u)},
\end{equation}
where $B(u)$ denotes the set of referrer nodes of node $u$.
\ 
$W^{in}_{(v,u)}, W^{out}_{(v, u)}$ evaluate popularity from the number of inlinks and outlinks, respectively. 
\begin{equation}
\begin{aligned}
    W^{in}_{(v,u)} &= \frac{I_u}{\sum_{p\in R(v)} I_p},
    \\
    W^{out}_{(v,u)} &= \frac{O_u}{\sum_{p\in R(v)} O_p},
\end{aligned}
\end{equation}
where $I, O$ is indegree and outdegree of node, and $R(v)$ denotes the reference nodes list of node $v$.

\textbf{Graphic-form Game:}
A $n$-player graphical game~\citep{Micharl2013Graphicgame} is a pair $(G, \{{\mathcal{S}}_i\}_{i\in N}, \{w_i\}_{i\in N})$, where $G$ is an undirected graph on $n$ vertices ($N$), player $i$ is donated by a vertex labelled $i$ in graph $G$. ${\mathcal{S}}_i$ is strategy space of $i$, and cost function $w_i: \prod_{j\in N_G(i)} {\mathcal{S}}_j \rightarrow \mathbb{R}$, where $N_G(i)\in \{1,\cdots,n\}$ represents the set of neighbors of player $i$ in $G$. 
By convention, $i$ itself is always in $N_G(i)$. 


The computation of Shapley value requires sampling all permutations for a grand coalition ${\mathcal{N}}$, which is the exponential time complexity. 
\ 
Therefore, Monte Carlo permutation sampling is proposed to approximate the Shapley value in linear time~\citep{Castro2009PolynomialCO}. 
\ 
The detailed pseudocode is presented in Appendix ***.

\clearpage
