\documentclass{uai2023} % for initial submission
% \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{bm}
\usepackage{amssymb,amsmath,amsthm}

\newtheorem{theorem}{Theorem}
\renewcommand\thesection{\Alph{section}}

%added by zhangzq
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage[switch]{lineno}
\usepackage{diagbox}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
% \usepackage[pdftex,linkcolor=blue,citecolor=blue,backref=page]{hyperref}


%added by zhangzq
\usepackage{amssymb}
\usepackage{comment}
\usepackage{subfigure}
%\usepackage[table]{xcolor}
\usepackage{xcolor}
\definecolor{lightgray}{gray}{0.893}
\usepackage{colortbl}
%\usepackage{subcaption}

\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Fast Teammate Adaptation in the Presence of Sudden Policy Change\\(Supplementary Material)}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2023 paper}{Jane~J.~von~O'L\'opez}{}}
\author[1]{Harry~Q.~Bovik}
\author[1,2]{Further~Coauthor}
\author[3]{Further~Coauthor}
\author[1]{Further~Coauthor}
\author[3]{Further~Coauthor}
\author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    Computer Science Dept.\\
    Cranberry University\\
    Pittsburgh, Pennsylvania, USA
}
\affil[2]{%
    Second Affiliation\\
    Address\\
    …
}
\affil[3]{%
    Another Affiliation\\
    Address\\
    …
  }
  
  \begin{document}
  
\onecolumn
\maketitle

\section{Related Work} 
\paragraph{Cooperative Multi-agent Reinforcement Learning}
Many real-world problems are made up of multiple interactive agents, which could usually be modeled as a multi-agent system~\citep{DBLP:journals/access/DorriKJ18}. Among the multitudinous solutions, Multi-Agent Reinforcement Learning (MARL)~\citep{zhang2021multi} has made great success profit from the powerful problem-solving ability of deep reinforcement learning~\citep{ Wang2020DeepRL}. Further, when the agents hold a shared goal, this problem refers to cooperative MARL~\citep{oroojlooy2022review}, showing great progress in diverse domains like path finding~\citep{sartoretti2019primal},  active voltage control~\citep{DBLP:conf/nips/WangXGSG21}, and dynamic algorithm configuration~\citep{xue2022multiagent}, etc. Many methods are proposed to facilitate coordination among agents, including policy-based ones (e.g., MADDPG~\citep{maddpg}, MAPPO~\citep{mappo}),  value-based series like VDN~\citep{vdn}, QMIX~\citep{qmix}, or other techniques like transformer~\citep{wen2022multiagent} and many variants~\citep{gorsane2022towards}, demonstrating remarkable coordination ability in a wide range of tasks like
SMAC~\citep{pymarl}, Hanabi~\citep{mappo}, GRF~\citep{wen2022multiagent}. 
Besides the mentioned approaches and the corresponding variants, many other methods are also proposed to investigate the cooperative MARL from other aspects, including casual inference among agents~\citep{grimbly2021causal}, policy deployment in an offline way for real-world application~\citep{DBLP:conf/nips/YangMLZZHYZ21}, 
 communication~\citep{zhu2022survey} for partial observability, model learning for sample efficiency improvement~\citep{wang2022model}, policy robustness when perturbations occur~\citep{guo2022towards}, training paradigm like CTDE (centralized training with decentralized execution)~\citep{DBLP:conf/atal/LyuXDA21}, testbed design for continual coordination validation~\citep{DBLP:conf/icml/NekoeiBCC21}, and ad hoc teamwork~\citep{mirsky2022survey}, etc. 


\textbf{Non-stationary} is a longstanding topic in single-agent reinforcement learning (SARL)~\citep{Padakandla2019ReinforcementLA,Padakandla2020ASO}, where the environment dynamic (e.g., transition and reward functions) of a learning system may change over time. For SARL, most existing works focus on inter-episode non-stationarity, where decision
processes are non-stationary across episodes, including multi-task setting~\citep{VithayathilVarghese2020ASO}, continual reinforcement learning~\citep{DBLP:journals/jair/KhetarpalRRP22}, meta reinforcement learning~\citep{beck2023survey}, etc., these problems can be formulated as a contextual MDP~\citep{hallak2015contextual}, and could be solved by techniques like task embeddings learning. Other works also consider intra-episode non-stationarity, where an agent may suffer from dynamic drifting within one single episode~\citep{DBLP:conf/rss/KumarFPM21,DBLP:conf/iclr/RenSJSWB22,chen2022an,escp,DBLP:conf/case/DastiderL22,feng2022factored}. Specifically, HDP-C-MDP~\citep{DBLP:conf/iclr/RenSJSWB22} assumes the latent context to be finite and Markovian, and adapts a sticky Hierarchical Dirichlet Process (HDP) prior for
model learning; while FANS-RL~\citep{feng2022factored} assumes the latent context is Markovian and the environment can be modeled as a factored MDP; ESCP~\citep{escp} considers the sudden changes one agent may encounter and obtains a robust policy via learning an auxiliary context recognition model. Experiments show that in environments with both in-distribution and out-of-distribution
parameter changes, ESCP can not only better recover the environment encoding, but also adapt more rapidly to the post-change environment
; SeCBAD~\citep{chen2022an} further assumes the environment context usually stays stable for a stochastic
period and then changes in an abrupt and unpredictable manner. 

Differing from the SARL setting, non-stationarity is an inherent challenge for MARL, as the agent's policy may be instability caused by the concurrent learning of multiple policies of other agents~\citep{papoudakis2019dealing}. Previous works mainly focus on solving the non-stationary in the training phase, using techniques like modeling others~\citep{DBLP:journals/ai/AlbrechtS18}, meta policy adaptation~\citep{DBLP:conf/icml/Kim0RSAHLTH21}, experience sharing~\citep{DBLP:conf/nips/ChristianosSA20}.
Other works concentrate on non-stationarity across episodes, like multi-task training~\citep{qin2022multi}, adversarial training~\citep{DBLP:conf/atal/Xue0AROY22,sun2023certifiably}, training policy for zero-shot coordination~\citep{DBLP:conf/icml/HuLPF20}, other works also investigate the policy robust of a multi-agent system when perturbations happen in a different way, including the uncertainty in local observation~\citep{lin2020robustness}, model function~\citep{zhang2020robustmarl}, action making~\citep{hu2020robust} and message sending~\citep{DBLP:conf/atal/Xue0AROY22,sun2023certifiably}
Although these approaches make progress somewhat, they leave the non-stationary caused by teammates' policy sudden change to a less involved but urgent-needed point.


\section{Details about Derivation}

\subsection{Details about CRP and derivation of cluster assignment}
Chinese restaurant process (CRP)~\citep{crp} is a discrete-time stochastic process that defines a prior distribution over the cluster structures, which can be described simply as follows. A customer comes into a Chinese restaurant, he chooses to sit down alone at a new table with a probability proportional to a concentration parameter $\alpha$  or sits with other customers with a probability proportional to the number of customers sitting on the occupied table. Customers sitting at the same table will be assigned to the same cluster. Concretely, suppose that $K$ customers sit in the restaurant currently. Let $z_i$ be an indicator variable that tells which table that $i^{\text{th}}$ sits on, and $n_m$ denote the number of customers sitting at the $m^{\text{th}}$ table, and $M$ be the total number of non-empty tables. Note that $\sum_{m=1}^M n_m=K$. The probability that the $K+1^{\text{th}}$ customer sits at the $m^{\text{th}}$ table is:
\begin{equation}
    \begin{aligned}
        P(z_{K+1}=m|\alpha)=\frac{n_m}{K+\alpha}, \quad m=1,...,M.
    \end{aligned}
\end{equation}
 There is some probability that the customer decides to sit at a new table and if the label of the new table is $M+1$, then:
\begin{equation}
    \begin{aligned}
        P(z_{K+1}=M+1|\alpha)=\frac{\alpha}{K+\alpha}.
    \end{aligned}
\end{equation}
Taken together, the two equations characterize the CRP.

 The cluster assignment of the $k^{\text{th}}$ generated teammate group $P(v_k^{(m)}|\tau_k^S,\tau_k^A)$ can be decomposed:
 \begin{equation}
     \begin{aligned}
         &P(v_k^{(m)}|\tau_k^S,\tau_k^A)\\ 
         &\quad=\frac{P(v_k^{(m}, \tau_k^S, \tau_k^A)}{P(\tau_k^S, \tau_k^A)}\\
         &\quad=\frac{P(\tau_k^S,\tau_k^A|v_k^{(m)})P(v_k^{(m)})}{P(\tau_k^S,\tau_k^A)}\\
         &\quad=\frac{P(\tau_k^A|\tau_k^S, v_k^{(m)})P(\tau_k^S|v_k^{(m)})P(v_k^{(m)})}{P(\tau_k^S,\tau_k^A)}\\
         &\quad\propto P(\tau_k^A|\tau_k^S, v_k^{(m)})P(\tau_k^S|v_k^{(m)})P(v_k^{(m)}).
     \end{aligned}
 \end{equation}
 As $\tau_k^S$ is a set of states that is not determined by the behavioral type of the teammates if neglecting the correlation in time dimensionality. $P(\tau_k^S|v_k^{(m)})$ can be considered as a constant. Accordingly, we would derive that $P(v_k^{(m)}|\tau^S_k, \tau^A_k)\propto P(v_k^{(m)})P(\tau_k^A|\tau_k^S; v_k^{(m)})$.
 
\subsection{The full derivation of $\mathcal{L}_{\text{GCE}}$}
To guide the context encoder to identify and track the sudden change rapidly, ESCP~\citep{escp} proposes the following optimization objective:
\begin{equation}
    \begin{aligned}
        \mathcal{L}_{\text{GCE}} = \sum_{m=1}^M\mathbb{E}[||z^m_t-\mathbb{E}[z^m_t]||_2^2] + ||\mathbb{E}[z^m_t] - u^m||_2^2,
    \end{aligned}
    \label{obj_escp}
\end{equation}
where $z_t^m$ is the representation that context encoder embeds in the $m^{\text{th}}$ environment,  $u^m$ is the oracle latent context vector, and $M$ is the number of environments. For a better understanding, we would explain the meanings of symbols based on our setting in the following. So, $z_t^m$ is the latent context vector when paired with teammates belonging to the $m^{\text{th}}$ cluster, and $u^m$ is the oracle behavior type.

 Since we have no access to the oracle $u^m$, a set of surrogates that possesses large diversity is required to be separable and representative. Meanwhile, $u^m$ is an intermediate variable used to guide $\mathbb{E}[z_t^m]$, so we could directly maximize the diversity of $\{\mathbb{E}[z_t^m]\}_{m=1}^M$ by maximizing the determinant of a relational matrix $R_{\{\mathbb{E}[z_t^m]\}}$. Each element of the relational matrix is:
\begin{equation}
    \begin{aligned}
            R_{\{\mathbb{E}[z_t^m]\}}(i, j) = \exp(-\kappa{||\mathbb{E}[z_t^i]-\mathbb{E}[z_t^j]||_2^2}),
    \end{aligned}
\end{equation}
where $\kappa$ is the radius hyperparameter of the radius basis function applied to calculate the distance of two vectors. The objective function can now be written as:
\begin{equation}
    \begin{aligned}
        \mathcal{L}_{\text{GCE}} = \sum_{m=1}^M\mathbb{E}[||z^m_t-\mathbb{E}[z^m_t]||_2^2]-\log\det(R_{\{\mathbb{E}[z_t^m]\}}).
    \end{aligned}
\end{equation}

To stabilize the training process, ESCP substitutes $\mathbb{E}[z_t^m]$ with $\bar z^m$, which is the moving average of all past context vectors. $\{\bar z^m\}$ will be updated after sampling a new batch of $z_t^m$: 
\begin{equation}
    \begin{aligned}
        \bar z^m=\eta \text{sg}(\bar z^m)+(1-\eta)\mathbb E[z_t^m],
    \end{aligned}
\end{equation}
where $\text{sg}(\cdot)$ denotes stopping gradient, and $\eta$ is a hyperparameter controlling the moving average horizon.

\subsection {Variational Bound of teammates context approximation}
\label{mi}

In order to make context vector $e_t^{m, i}$ generated by local trajectory encoder $f_{\phi_i}$ informatively consistent with global context $z_t^m$ encoded by $g_{\theta}$, we propose to maximize the mutual information between $e_t^{m, i}$ and $z_t^m$ conditioned on the agent $i$'s local trajectory $\tau_t^{m, i}$. We draw the idea from variational inference~\citep{DBLP:conf/iclr/AlemiFD017} and derive a lower bound of this mutual information term.

\begin{theorem} 
Let $\mathcal{I}(e_{t}^{m, i};z_t^m
|\tau_t^{m,i})$ be the mutual information between the local context $e_t^{m, i}$ of agent $i$ and global context $z_t^m$ conditioned on agent $i$'s local trajectory $\tau^{m, i}_t$. The lower bound is given by
\begin{equation}
    \mathbb{E}_{\mathcal{D}}[\log q_{\xi}(e_{t}^{m, i}|z_t^m, \tau^{m, i}_t)]+\mathcal{H}(e_{t}^{m, i}|\tau^{m, i}_t).
\end{equation}
\end{theorem}
Here $m$ is the cluster id of the teammates cooperating with controlled agents to finish the task in this episode.

\begin{proof}
By a variational distribution $q_{\xi}(e_{t}^{m, i}|z_t^m, \tau^{m, i}_t)$ parameterized by $\xi$, we have
\begin{equation}
    \begin{aligned}
        &\mathcal{I}(e_{t}^{m, i};z_t^m
|\tau_t^{m,i})  \\
        =& \mathbb{E}_{\mathcal{D}}\Big[ \log \frac{p(e_{t}^{m, i};z_t^m
|\tau_t^{m,i})}{p(e_{t}^{m,i}|\tau_t^{m, i})p(z_t^m|\tau_t^{m, i})}\Big] \\
        =&\mathbb{E}_{\mathcal{D}}\left[\log \frac{p(e_{t}^{m, i}|z_t^m; \tau_t^{m,i})}{p(e_{t}^{m,i}|\tau_t^{m, i})}\right] \\
        =&\mathbb{E}_{\mathcal{D}}\left[\log \frac{q_{\xi}(e_{t}^{m, i}|z_t^m, \tau^{m, i}_t)}{p(e_{t}^{m,i}|\tau_t^{m, i})}\right]+ \\
        &D_{\text{KL}}(p(e_t^{m, i}|z_t^m, \tau_t^{m,i}) || q_{\xi}(e_{t}^{m, i}|z_t^m, \tau^{m, i}_t)) \\
        \geq & \mathbb{E}_{\mathcal{D}}\left[\log \frac{q_{\xi}(e_{t}^{m, i}|z_t^m, \tau^{m, i}_t)}{p(e_{t}^{m,i}|\tau_t^{m, i})}\right] \\
        = & \mathbb{E}_{\mathcal{D}}[\log q_{\xi}(e_{t}^{m, i}|z_t^m, \tau^{m, i}_t)]+\mathcal{H}(e_{t}^{m, i}|\tau^{m, i}_t).
    \end{aligned}
\end{equation}
\end{proof}

\section{Details About Baselines and Benchmarks}
\subsection{Baselines Used}
\paragraph{QMIX~\citep{qmix}:} As we investigate the integrative abilities of Fastap in the manuscript, here we introduce the value-based method QMIX~\citep{qmix} used in this paper. Our proposed framework Fastap follows the \textit{Centralized Training with Decentralized Execution} (CTDE) paradigm used in value-based MARL methods, as well as the Individual-Global-Max (IGM)~\citep{QTRAN} principle, which asserts the consistency between joint and local greedy action selections by the joint value function $Q_{\rm tot}(\boldsymbol{\tau}, \boldsymbol{a})$ and individual value functions $\left[Q_i(\tau^i, a^i)\right]_{i=1}^n$:
\begin{equation}
\begin{aligned}
  & \forall \boldsymbol{\tau} \in \boldsymbol{\mathcal{T}}, \underset{\boldsymbol{a} \in \boldsymbol{\mathcal{A}}}{\arg \max } Q_{\rm tot}(\boldsymbol{\tau}, \boldsymbol{a})= \\
     &\left(\underset{a^{1} \in \mathcal{A}}{\arg \max } Q_{1}\left(\tau^{1}, a^{1}\right), \ldots, \underset{a^{n} \in \mathcal{A}}{\arg \max } Q_{n}\left(\tau^{n}, a^{n}\right)\right).
     \end{aligned}
\end{equation}


\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{Figures/qmix.pdf}
\caption{The overall structure of QMIX. (a) The detailed structure of the mixing network, whose weights and biases are generated from a hyper-net (red) which takes the global state as the input. (b) QMIX is composed of a mixing network and several agent networks. (c) The detailed structure of the individual agent network. }
\label{fig:QMIX}
\end{figure}
QMIX extends VDN by factorizing the global value function $Q_{\rm tot}^{\mathrm{QMIX}}(\boldsymbol{\tau}, \boldsymbol{a})$ as a monotonic combination of the agents' local value functions $\left[Q_i(\tau^i, a^i)\right]_{i=1}^n$:
\begin{equation}
     \forall i \in \mathcal{N}, \frac{\partial Q_{\rm tot}^{\mathrm{QMIX}}(\boldsymbol{\tau}, \boldsymbol{a})}{\partial Q_{i}\left(\tau^{i}, a^{i}\right)}>0.
\end{equation}

We mainly implement Fastap on QMIX for its proven performance in various papers, and its overall structure is shown in Fig.~\ref{fig:QMIX}. QMIX uses a hyper-net conditioned on the global state to generate the weights and biases of the local Q-values and uses the absolute value operation to keep the weights positive to guarantee monotonicity.


\paragraph{PEARL~\citep{PEARL}:} This baseline comes from single-agent and meta-learning settings. It aims to represent the environments according to some hidden representations. Concretely, PEARL utilizes the transition data as context to infer the feature of the environment, which is modeled by a product of Gaussians. When it is applied to MARL tasks, the PEARL module is adopted and optimized for local context encoders of each individual controllable agent. 

\paragraph{ESCP~\citep{escp}:} As a single-agent reinforcement learning algorithm that aims to recognize and adapt to new environments rapidly when encountering a sudden change in environments, the optimization objective Eqn.~\ref{obj_escp} is applied to optimize a context encoder. To cater to the framework and specific tasks in MARL, the history is not truncated, and each controllable agent is equipped with a local encoder.

\paragraph{LIAM~\citep{LIAM}:}
A method equips each agent with an encoder-decoder structure to predict other agents' observations $\boldsymbol o^{-1}_t$ and actions $\boldsymbol a^{-1}_t$ at current timestep based on its own local observation history $\tau_t=\{o_{0:t}\}$. The encoder and decoder are optimized to minimize the mean square error of observations plus the cross-entropy error of actions. To fit in the MARL setting in our work, local context encoders of controllable agents will be asked to predict the teammates' observations and actions based on their local trajectories. The mean value of their loss is used to optimize the encoders.

\paragraph{ODITS~\citep{ODITS}:}
Unlike the previous two methods that predict the actual behaviors of teammate agents, ODITS improves zero-shot coordination performance in an end-to-end fashion. Two variational encoders are adopted to improve the coordination capability. The global encoder takes in the global state trajectory as input and outputs a  Gaussian distribution. A vector $z$ is sampled and fed into hyper-network that maps the ad hoc agent's local utility $Q_i$ into global utility $Q_{\rm tot}$ to approach the global discounted return. The local encoder has a similar structure and the sampled $e$ is fed into the ad hoc agent's policy network. The encoders are updated by maximizing the return, together with the mutual information of the two context vectors conditioned on the local transition data in an end-to-end manner. As ODITS considers only a single ad hoc agent, we also equip each controllable agent with a local trajectory encoder and maximize the mean of mutual information loss to fit in our MARL's setting.


\subsection{Relevant Environments}
\paragraph{Level-Based Foraging (LBF)~\citep{lbf}:}
LBF is a mixed cooperative-competitive partially observable grid-world game that requires highly coordinated agents to complete the task of collecting the foods. The agents and the foods are assigned with random levels and positions at the beginning of an episode. The action space of each agent consists of the movement in four directions, loading food next to it and a ``no-op'' action, but the foods are immobile during an entire episode. A group of agents can collect the food if the summation of their levels is no less than the level of the food and receive a normalized reward correlated to the level of the food. The main goal of the agents is to maximize the global return by cooperating with each other to collect the foods in a limited time. 

To test the performance of different algorithms in this setting, we consider a scenario with four (at most) agents with different levels and three foods with the minimum levels $l\geq \sum_{i=1}^3 sorted(levels)[i] $ in a $6\times 6$ grid world. Agents have a limited vision with a range of $1$ ($3\times 3$ grids around the agent), and the episode is under a limited horizon of 25. In our Open Dec-POMDP setting, two agents are controllable and will stay in the environment for the whole episode. The number of teammates might be $1$ or $2$, and the policy network will change as well. The rewards that the agents receive are the quotient of the level of the food they collect divided by the summation of all the food levels, as follows: 
    \begin{equation}
    \label{rewardlbf}
      \begin{split}
        r^i = \frac {\rm Food\_with\_Level\_i} { \sum_j \rm Food\_with\_Level\_j}.
        \end{split}
    \end{equation}

\paragraph{Predator-prey (PP)~\citep{maddpg}:} This is a predator-prey environment. Good agents (preys) are faster and receive a negative reward for being hit by adversaries (predators) (-10 for each collision). Predators are slower and are rewarded for hitting good agents (+10 for each collision). Obstacles block the way. By default, there is 1 prey, 3 predators, and 2 obstacles. In our Open Dec-POMDP setting, two predators are controllable and will stay in the environment for the whole episode. The other predator is the uncontrollable teammate whose policy changes suddenly.

\paragraph{Cooperative navigation (CN)~\citep{maddpg}:} In this task, four agents are trained to move to four landmarks while avoiding collisions with each other. All agents receive their velocity, position, and relative position to all other agents and landmarks. The action space of each agent contains five discrete movement actions. Agents are rewarded with the sum of negative minimum distances from each landmark to any agent, and an additional term is added to punish collisions among agents. In our Open Dec-POMDP setting, two agents are controllable and will stay in the environment for the whole episode. The number of teammates might be $1$ or $2$, and the policy network will change as well.


\paragraph{StarCraft II Micromanagement Benchmark (SMAC)~\citep{pymarl}:} SMAC is a combat scenario of StarCraft II unit micromanagement tasks. 
We consider a partial observation setting, where an agent can only see a circular area around it with a radius equal to the sight range, which is set to $9$. We train the ally units with reinforcement learning algorithms to beat enemy units controlled by the built-in AI. At the beginning of each episode, allies and enemies are generated at specific regions on the map. Every agent takes action from the discrete action space at each timestep, including the following actions: no-op, move [direction], attack [enemy id], and stop. Under the control of these actions, agents can move and attack in continuous maps. MARL agents will get a global reward equal to the total damage done to enemy units at each timestep. Killing each enemy unit and winning the combat (killing all the enemies) will bring additional bonuses of $10$ and $200$, respectively. Here we create a map named 10m\_vs\_11m, where 10 allies and 14 enemies are divided into 2 groups separately, and they are spawned at different points to gather together and enforce attacks on the same group of enemies to win this task. Specifically, we control 7 allies to cooperate with 3 other teammates to finish the task, where the number of teammates keeps unchangeable during an episode.


\section{The Architecture, Infrastructure, and Hyperparameters Choices of Fastap}
Since Fastap is built on top of QMIX in the main experiments, we here present detailed descriptions of specific settings in this section, including network architecture, the overall flow, and the selected hyperparameters for different environments.
\subsection{Network Architecture}
In this section, we would give details about the following networks: (1) encoder $E_{\omega_1}$ and decoder $D_{\omega_2}$ in CRP process, (2) trajectory encoder $g_\theta$, $f_{\phi_i}$, and agent networks, and (3) variational distribution $q_\xi$ and teammates modeling decoder $h_{\psi_i}$.

The 8-layer transformer encoder $E_{\omega_1}$ takes global trajectory $\tau=(s_0, \boldsymbol{a}_0,..., s_T)$ as inputs and outputs 16-dimensional behavioral embeddings $v$. The RNN-based decoder $D_{\omega_2}$, consisting of a GRU cell whose hidden dimension is 16, takes $\tau_t^X=(s_0,..., s_t)$ and $v$ as input and reconstructs the action $\boldsymbol{a}_t$.

For the global and local trajectory encoder $g_\theta$ and $f_{\phi_i}$, we design it as a 2-layer MLP and GRU, and the hidden dimension is 64. Then a linear layer transforms the embeddings into mean values and standard deviations of a Gaussian distribution. The context vector will be sampled from the distribution. The global context $z_t$ and state $s_t$ will be concatenated and input into the hypernetwork. As for the local context $e_t^i$, it, together with local trajectory $\tau_t^i$, will be input into the agent $i$'s individual Q network, having a GRU cell with a dimension of 64 to encode historical information and two fully connected layers, to compute the local Q values $Q^i(\tau_t^i, e_t^i, \cdot)$. The local Q values will be fed into the mixing network to calculate TD loss finally.

To maximize the mutual information between local and global context vectors conditioned on the agent $i$'s local trajectory, a variational distribution network $q_\xi$ is used to approximate the conditional distribution. Concretely, $q_\xi$ is a 3-layer MLP with a hidden dimension of 64, and it outputs a Gaussian distribution where the predicted local context vector will be sampled. The agent modeling decoder $h_{\psi_i}$ is divided into two components including $h_{\psi_i}^o$ and $h_{\psi_i}^a$, where each one is a 3-layer MLP. Mean squared loss and maximum likelihood loss are calculated to optimize the objective, respectively.

\subsection{The Overall Flow of Fastap}
To illustrate the overall flow of Fastap, we first show the CRP-based infinite mixture procedure in Alg.~\ref{alg1}. A teammate group can be generated via any MARL algorithm, and we store the small batch of trajectories into a replay buffer $\mathcal{D}_k$ (Line 2\textasciitilde 3). The encoder and decoder are trained to force the learned representation to precisely capture the behavioral information and precisely estimate the predictive likelihood (Line 4). Afterward, the CRP prior and predictive likelihood are calculated to determine the assignment of the newly generated teammate group $m^*$ (Line 5\textasciitilde 7). Then, we update the existing cluster or instantiate a new cluster based on the assignment (Line 8\textasciitilde 17).

The training process of Fastap is also shown in Alg.~\ref{alg2}. During the trajectory sampling stage, we first sample a teammate group from the cluster and fix it in this episode. The teammate group pairs with the controllable agents and they make decisions together (Line 3\textasciitilde12). To train the agent policy networks and the context encoders, the moving average values of context vectors are updated and the optimization objectives are calculated (Line 14\textasciitilde22). Besides, we present the testing process in Alg.~\ref{alg3}, where teammates might change suddenly. A sudden change distribution $\mathcal{U}$ controls the waiting time that determines the changing frequency (Line 5\textasciitilde 12).

\begin{algorithm}[!ht]
    \caption{Fastap: CRP-based infinite mixture procedure}
    \label{alg1}
    \textbf{Input}: concentration param $\alpha$, num of teammate groups generated in one iteration $L$, number of teammate groups generated so far $K$, number of clusters instantiated so far $M$, encoder $E_{\omega_1}$, decoder $D_{\omega_2}$.
    
    \begin{algorithmic}[1] 
        \FOR{$k=K+1,..,K+L$}
            \STATE Generate the $k^{\text{th}}$ teammate group.
            \STATE Sample small batch of trajectories $\tau_k$ of the $k^{\text{th}}$ teammate group and store them into $\mathcal{D}_k$.
            \STATE Train $E_{\omega_1}$ and $D_{\omega_2}$ according to $\mathcal{L}_{\text{model}}$ in Eqn.~4. 
            \STATE Calculate the CRP prior $P(v_k^{(m)}), m=1,2,...,M+1$ according to Eqn.~2.
            \STATE Calculate the predictive likelihood $P(\tau_k^Y|\tau_k^X;v_k^{(m)}), m=1,2,...,M+1$ according to Eqn.~3.
            \STATE $m^*=\arg\max_{m}P(v_k^{(m)})P(\tau_k^Y|\tau_k^X;v_k^{(m)})$.
            \IF{$m^*\leq M$}
                \STATE Assign the $k^{\text{th}}$ teammate group to the $m^*$ cluster.
                \STATE Update the cluster center $\bar v^{m^*}=\frac{n^{(m^*)}\bar v^{m^*}+v_k}{n^{(m^*)}+1}$.
                \STATE Update the counter of the cluster $m$: $n^{(m^*)}=n^{(m^*)}+1$.
            \ELSE
                \STATE Initialize the $M+1^{\text{th}}$ cluster with the $k^{\text{th}}$ teammate group.
                \STATE Initialize the cluster center $\bar v^{M+1}=v_k$.
                \STATE Initialize the counter of the cluster $M+1$: $n^{(M+1)}=1$.
                \STATE Update $M=M+1$.
            \ENDIF
        \ENDFOR
        \STATE Update $K=K+L$.
    \end{algorithmic}
\end{algorithm}

\begin{algorithm}[!ht]
    \caption{Fastap: training process}
    \label{alg2}
    \textbf{Input}: controllable agent policy networks $\{\pi^i\}_{i=1}^n$, global trajectory encoder $g_\theta$, local trajectory encoders $\{f_{\phi_i}\}_{i=1}^n$, teammate group clusters $\mathcal{C}$, number of clusters instantiated so far $M$, episode length $T$, number of sampled episodes $sample\_num$, environment $env$.
    \begin{algorithmic}[1] 
        \STATE Initialize moving average $\bar z^m=\boldsymbol{0}, m=1,...,M$.
        \STATE Initialize moving average $\bar e^{m,i}=\boldsymbol{0}, m=1,...,M; i=1,..,n$.
        \FOR{$l=1,...,sample\_num$}
            \STATE sample teammate group from $\mathcal{C}$ belonging to the $m^{\text{th}}$ cluser.
            \STATE $s_0^m = env.start()$.
            \FOR{$t=0,...,T$}
                \STATE $e_t^{m,i}=f_{\phi_i}(\tau_t^{m,i}),\quad i=1,...,n$.
                \STATE $a_t^{m,i} = \pi^i(\tau_t^{m,i}, e_t^{m,i}),\quad i=1,...,n$.
                \STATE$\boldsymbol{a}^m_t=(a_t^{m,i})_{i=1}^n$. \texttt{\small // controllable agents decision-making}
                \STATE $\boldsymbol{\bar a}^m_t=\boldsymbol{\bar \pi}^m(\boldsymbol{\bar \tau}_t^m)$. \texttt{\small // uncontrollable teammates decision-making}
                \STATE $s_{t+1}^m, r_t^m=env.step(\langle\boldsymbol{a}^m_t, \boldsymbol{\bar a}^m_t\rangle)$.
            \ENDFOR
            \STATE Add trajectory to the replay buffer $\mathcal{D}$.
            \FOR{$m=1,..,M$}
                \STATE Sample $bs$ trajectories from $\mathcal{D}$.
                \STATE Calculate estimated Q-values and context vectors $z^m_t=g_\theta(\tau_t^m), e_t^{m,i}=f_{\phi_i}(\tau_t^{m, i}),\quad t=0,...,T$.
                \STATE Update $\bar z^m=\eta\text{sg}(\bar z^m)+(1-\eta)\text{mean}(z_t^m)$.
                \STATE Update $\bar e^{m,i}=\eta\text{sg}(\bar e^{m,i})+(1-\eta)\text{mean}(e_t^{m, i})$.
                \STATE Optimize agent Q networks according to $\mathcal{L}_{\text{TD}}$.
            \ENDFOR
            \STATE Optimize $g_{\theta}$ according to $\mathcal{L}_{\text{ADAP}}$ in Eqn.~6.
            \STATE Optimize $\{f_{\phi_i}\}_{i=1}^n$ according to $\mathcal{L}_{\text{DEC}}$ in Eqn.~11.
        \ENDFOR
    \end{algorithmic}
\end{algorithm}


\begin{algorithm}[!ht]
    \caption{Fastap: testing process}
    \label{alg3}
    \textbf{Input}: controllable agent policy networks $\{\pi^i\}_{i=1}^n$, local trajectory encoders $\{f_{\phi_i}\}_{i=1}^n$,  episode length $T$, number of test episodes $test\_num$, environment $env$, sudden change distribution $\mathcal{U}$, teammates set $\mathcal{\bar N}$.
    \begin{algorithmic}[1] %[1] enables line numbers
        \FOR{$l=1,...,test\_num$}
            \STATE Sample teammate policy $\boldsymbol{\bar \pi}$ from $\mathcal{\bar N}$.
            \STATE $s_0 = env.start()$.
            \FOR{$t=0,...,T$}
                \IF{$t=0$}
                    \STATE Sample waiting time $u_0\sim \mathcal{U}$.
                \ELSE
                    \STATE Update waiting time $u_t=u_{t-1}-1$.
                    \IF{$u_t\leq 0$}
                        \STATE Re-sample $u_t\sim \mathcal{U}$.
                        \STATE Re-sample teammate policy $\boldsymbol{\bar \pi}$ from $\mathcal{\bar N}$.
                    \ENDIF
                \ENDIF
                \STATE $e_t^{i}=f_{\phi_i}(\tau_t^{i}),\quad i=1,...,n$.
                \STATE $a_t^{i} = \pi^i(\tau_t^{i}, e_t^{i}),\quad i=1,...,n$.
                \STATE$\boldsymbol{a}_t=(a_t^{i})_{i=1}^n$. \texttt{\small // controllable agents decision-making}
                \STATE $\boldsymbol{\bar a}_t=\boldsymbol{\bar \pi}(\boldsymbol{\bar \tau}_t)$. \texttt{\small // uncontrollable teammates decision-making}
                \STATE $s_{t+1}, r_t, done=env.step(\langle\boldsymbol{a}_t, \boldsymbol{\bar a}_t\rangle)$.
            \ENDFOR
        \ENDFOR
    \end{algorithmic}
\end{algorithm}


Our implementation of Fastap is based on the EPymarl\footnote{\url{https://github.com/oxwhirl/epymarl}}~\citep{lbf} codebase with StarCraft 2.4.6.2.69223 and uses its default hyper-parameter settings (e.g.,   $\gamma=0.99$). The selection of other additional hyperparameters for different environments is listed in Tab.\ref{table_hyper}.

\begin{table}[!ht]
    \centering
    \resizebox{\textwidth}{!}{
    \begin{tabular}{|l|cccc|}
    \hline
    \diagbox[width=24em]{Hyperparameter}{\begin{tabular}[c]{@{}l@{}}Environment\\\end{tabular}}               & Level-Based Foraging   & Predator-prey    & Cooperative navigation    & 10m\_vs\_14m  \\ \hline
    concentration hyperparameter $\alpha$           & $0.5$   & $2.5$   & $2.5$   & $0.5$   \\
    number of teammate groups generated in one iteration $L$ & $4$     & $1$     & $1$     & $2$     \\
    radius hyperparameter $\kappa$       & $80$   & $80$    & $80$    & $80$    \\
    moving average hyperparameter $\eta$ & $0.01$  & $0.01$  & $0.01$  & $0.01$  \\
    $\alpha_{\text{GCE}}$                         & $1$     & $0.4$   & $0.4$   & $10$    \\
    $\alpha_{\text{LCE}}$                         & $1$     & $0.4$   & $0.4$   & $10$    \\
    $\alpha_{\text{MI}}$                          & $0.001$ & $0.001$ & $0.001$ & $0.001$ \\
    $\alpha_{\text{REC}}$                         & $0.1$   & $0.2$   & $0.2$   & $0.2$   \\
    dimension of local context vector $e$                       & $4$     & $16$    & $4$     & $8$     \\
    dimension of global context vector $z$                       & $6$     & $20$    & $6$     & $16$    \\ \hline
    \end{tabular}}
    \caption{Hyperparameters in the experiments.}
    \label{table_hyper}
\end{table}
\section{More SENSITIVE STUDIES}
Here we further conduct more experiments on benchmark LBF to investigate how another two hyperparameters $\alpha_{\text{LCE}}, \alpha_{\text{REC}}$ influence the coordination ability. The results can be seen in Fig.~\ref{moresensitivity}, we can find that $\alpha_{\text{LCE}}=1, \alpha_{\text{REC}}=0.1$ are the corresponding best choices in a similar way as in the manuscript. 
\begin{figure}
\setlength{\abovecaptionskip}{0cm}
  \centering
  \subfigure[Sensitivity of $\alpha_{\text{LCE}}$]{
  \label{sensitivity_lce}
      \includegraphics[width=0.426\textwidth]
{Fastap/Figures/Sensitivity/Sensitivity_LCE.pdf}
  }
  \subfigure[Sensitivity of $ \alpha_{\text{REC}}$]{
  \label{sensitivity_rec}
      \includegraphics[width=0.426\textwidth]
{Fastap/Figures/Sensitivity/Sensitivity_REC.pdf}
  }
  \caption{More Sensitivity Studies on LBF.}
  \label{moresensitivity}
\end{figure}




\newpage

\bibliography{ref}
\end{document}
