Implementations that Matter in Cooperative Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning would like to implement some techniques that are known to have superior improvements in reinforcement learning. However, it is still unclear which of these techniques are complementary and can be fruitfully combined. This post dives into some proper experimental techniques that are controversial in multi-agent settings and gives out some advice to enhance the performance of MARL algorithms.

Given the powerful decision-making ability of reinforcement learning (RL), it is a general trend to apply it with some complementary methods in the field of multi-agent cooperative settings. However, these methods may not be universal. In this post, we would like to figure out some techniques that have been proved effective in RL while posing controversy and ambiguousness on QMIX [8], the representative baseline algorithm in multi-agent reinforcement learning (MARL). These inconsistencies also attract our attention to exploring the extreme performance of QMIX on StarCraft Multi-Agent Challenge (SMAC) [10], especially in the hardest scenarios. Furthermore, we briefly explore the role of monotonicity constraint in the mixing network and give out our proposals to enhance the performance of MARL algorithms under the CTDE paradigm. Still, we also aim to spur discussion about what matters in cooperative multi-agent scenarios, even the reproducibility, and comparability in MARL. We open-source the code at https://github.com/xxxx/xxxx (Anonymous) for researchers to evaluate the effects of these proposed techniques and other fair comparisons between algorithms.

From RL to MARL

Ever since AlphaGo beats humans at Go, RL has become a consistent hot spot in both academia and industry. The agent of RL can obtain some rewards by interacting with the environment and taking actions to maximize these cumulative rewards. Actually, almost all the RL problems can be described as Markov Decision Processes as illustrated in Figure 1.

Figure 1: The agent-environment interaction in a Markov decision process. (Image source: Sec. 3.1 Sutton & Barto (2017) [14])).


Just as its name implies, MARL contains multiple agents trained by RL algorithms in the same environment. Many complex multi-agent systems such as robot swarms control, autonomous vehicle coordination, and sensor networks, can be modeled as MARL tasks. The interaction of these agents would make them work together to achieve a common goal.

Figure 2: Some multi-agent cooperative scenarios [left-to-right]. (a) Chasing in Multi-Agent Particle Environment (Predator-Prey);
(b) MAgent Environment; (c) Hide & Seek; (d) StarCraft Multi-Agent Challenge.

Actually, agents usually have a limited sight range to observe their surrounding environment. As the example shown in Figure 3, the cyan border indicates the sight and shooting range of the agent, which means the agent could only obtain the information of terrain or other agents in that range. These kinds of multi-agent tasks can be modeled as decentralized partially observable Markov decision process (Dec-POMDP) [6], and the ultimate goal is to find a joint policy of agents $\boldsymbol{\pi} = \langle \pi_{1},…,\pi_{n}\rangle$ to get the maximal global reward.

Figure 3: The partial observation of agents
(Image source: SMAC [10]).

Apparently, the main challenges stand between MARL and practical applications include the inherent communication constraints, partial observability, and the Non-Stationarity resulting from the changing policies of other agents. These challenges make it troublesome for agents to achieve better cooperation and lead to unstable learning. A setting known as Centralized Training with Decentralized Execution (CTDE) [15] has been proposed to meet these challenges. It trains the policies in a centralized way, which would access the global state $s$ and local action-observation histories of all agents. However, each agent can only make its own decision based on its local action-observation history $\tau^{i}$ during execution. The nonstationarity in training would be alleviated by learning a shared centralized value function for all agents. In the algorithms that integrate each agent’s $Q_{i}$ together, QMIX is the representative and effective method to train the agents.

QMIX and Monotonicity Constraint

To deal with the relationship between the individual agent and the cooperative group, QMIX [8] learns a joint action-value function $Q_{tot}$, and factorizes the joint policy to the individual policy of each agent. In other words, as illustrated in Figure 4, QMIX integrates all the individual $Q_{i}$ with a mixing network to obtain a centralized value function $Q_{tot}$, which can be more appropriately updated by the global reward.

Figure 4: Framework of QMIX. (Image source: QMIX [8])


Still, it also can be represented in Eq.(\ref{eq1})

\[Q_{tot}(s, \boldsymbol{u} ; \boldsymbol{\theta}, \phi) = g_{\phi}\left(s, Q_{1}\left(\tau^{1}, u^{1} ; \theta^{1}\right), \ldots, Q_{N}\left(\tau^{N}, u^{N} ; \theta^{N}\right)\right) \\ \frac{\partial Q_{tot}(s, \boldsymbol{u} ; \boldsymbol{\theta}, \phi)}{\partial Q_{i}\left(\tau^{i}, u^{i}; \theta^{i}\right)} \geq 0, \quad \forall i \in \mathcal{N} \tag{1} \label{eq1}\]

where $\theta^i$ is the parameter of the agent network $i$, and $\phi$ is the trainable parameter of the mixing network, which is responsible to factorize $Q_{tot}$ to each agent $Q_{i}$. The Monotonicity Constraint is implemented in the mixing network, which inputs the global state $s$ and outputs nonnegative wights through hypernetwork. This delicate design ensures consistency between joint actions and the individual actions of each agent, then guarantees the Individual-Global-Max (IGM) principle. Benefiting from the monotonicity constraint in Eq. (\ref{eq1}), maximizing joint $Q_{tot}$ is precisely the equivalent of maximizing individual $Q_i$, which would also allow the optimal individual action to maintain consistency with optimal joint action. Furthermore, QMIX learns centralized value function $Q_{tot}$ by sampling a multitude of transitions from the replay buffer and minimizing the mean squared temporal-difference (TD) error loss:

\[\mathcal{L}(\theta)= \frac{1}{2} \sum_{i=1}^{b}\left[\left(y_{i}^{}-Q_{tot}(s, u ; \theta, \phi)\right)^{2}\right] \tag{2} \label{eq2}\]

where the TD target value $y=r+\gamma \underset{u^{\prime}}{\operatorname{max}} Q_{tot}(s^{\prime},u^{\prime};\theta^{-},\phi^{-})$, and $\theta^{-}, \phi^{-}$ are the target network parameters copied periodically from the current network and kept constant for a number of iterations.

It is not surprising there are so many subsequently developed variant algorithms of QMIX, which aim to relax the monotonicity constraint or learn a more stable and generalizable centralized value function. As a pioneer, Value-Decomposition Network (VDN) [13] only requires a linear decomposition where $Q_{tot} = \sum_{i}^{N} Q_i$, which also can be regarded as relaxing the monotonicity constraint. Qatten [17] introduces an attention mechanism to determine the proportion of each agent based on their observations. QTRAN [11] learns the discrepancy between $Q_{tot} = \sum_{i}^{N} Q_i$ and $Q_{tot}$, which would factorize the centralized critic function and train all the agents in an end-to-end way. QPLEX [15] transfers the monotonicity constraint from Q values to Advantage values [27], and introduces a duplex transformed network to integrate the state information. WQMIX [9] scales down the estimated centralized value of non-optimal joint actions, and further relaxes the monotonicity constraint with a true value network and some theoretical constraints. SMIX [20] enhances the QMIX by incorporating lite SARSA($\lambda$) in centralized critic, and MAVEN [3] introduces the committed exploration to persist joint exploratory policies for all the agents over an entire episode. VMIX [12] combines the Advantage Actor-Critic (A2C) [5] with QMIX to extend the monotonicity constraint to critic networks.

Since all these subsequent developed algorithms show their performance exceeds QMIX in SMAC, there is a question that is been plaguing us: is the performance of QMIX less than expected due to improper training parameters or techniques? We wish to know what kind of techniques would affect the performance of QMIX or even other cooperative MARL algorithms.

Extension to QMIX

Experimental Design

To facilitate the study of proper techniques affecting the training effectiveness and sample efficiency of QMIX, we perform a set of experiments designed to provide insight into some methods that have been proved effective in single-agent RL but may be ambiguous in MARL. In particular, we investigate the effects of: Adam optimizer with parallel rollout process; the incremental of replay buffer size; the number of parallel rollout process; $\epsilon$-exploration steps; the implementation of $Q(\lambda)$ in centralized value function; the hidden size of agents’ recurrent network. And we also study the role of monotonicity constraints in QMIX. For all experiments, we generally use PyMARL [10] framework to implement QMIX and its variants. To ensure fairness we run independent five experimental trials for each evaluation, each with a random seed. Unless otherwise mentioned, we use default settings as in PyMARL whenever possible, while incorporating the techniques of interest. All results are plotted with the median and shaded the interval.

StarCraft Multi-Agent Challenge (SMAC) As a commonly used testing environment, SMAC [10] sets an example to offer a great opportunity to tackle the cooperative control problems in the multi-agent domain. We focus on the micromanagement challenge in SMAC, which means each agent is controlled by an independent agency that conditions on a limited observation area, and these groups of units are trained to conquer the enemy consisting of built-in AI. According to the quantity and type of enemy, all testing scenarios could be divided into Easy, Hard, and Super-Hard levels. Since QMIX can effectively solve the Easy tasks, we pay our attention to some Hard and Super-Hard scenarios that QMIX failed to win, especially in Corridor, 3s5z_vs_3s6z, and 6h_vs_8z.

Predator-Prey (PP) is representative of another classical problem called relative overgeneralization [16] . The cooperating predators are trained to chase a faster running prey, and hope to capture this escaping robot with the fewest steps possible. We leverage two kinds of difficulty-enhanced Predator-Prey variants of environments to test the algorithms: (1) Predator-Prey-1 (PP-1) requires two predators to catch the prey at the same time to get a reward; (2) and Predator-Prey-2 (PP-2), whose policy of prey is replaced with a hard-coded heuristic policy, asks the prey to move to the farthest sampled position to the predator. These two environments require greater cooperation between agents.

Optimizer

As an important part of training neural networks, the selection of an optimizer is very important since it could seriously affect the training effect of the reinforcement learning agent. Without a further illustration, QMIX and other variant algorithms use RMSProp [21] to optimize the neural networks of agents as they prove stable in SMAC. While Adam [1] is famous for the fast convergence benefiting from the momentum in training, which seems to be the first choice for AI researchers. We reckon that momentum property in Adam would have some advantages in learning the sampled data which is generated by agents interacting with the environment as in MARL. And then, on the other hand, QMIX is criticized for performing sub-optimally and sample inefficiency when equipped with the A2C framework, which is implemented to promote the training efficiency of the RL algorithm. VMIX [12] argues this limitation is brought about by the value-based inherent Q function, so they extend QMIX to the actor-critic style algorithm to take advantage of the A2C framework. This controversy attracts our attention to evaluate the performance of QMIX using Adam, as well as the parallel sampling paradigm.

Figure 5: The Q networks optimized by Adam and RMSProp.


Results As shown in Figure 5, we run the Adam-supported QMIX with 8 rollout processes. Different from what was described in VMIX, the performance and efficiency of QMIX could be greatly improved by Adam. We speculate the reason is the momentum property in Adam could fastly fit the newly sampled data from the parallel rollout processes and then enhance the performance, while RMSProp failed. Hence the limitation posed by VMIX is most likely due to the selection of improper optimizers. Actually, Adam can still be an important consideration in MARL.

Rollout Process Number

Naturally, we come to focus on the benefits of parallel data sampling in QMIX. A2C [5] provides an excellent example to reduce training time and improve the training efficiency in single-agent RL. As we implement the algorithms under the paradigm of A2C, there is usually a defined total number of samples and an unspecified number of rollout processes. The total number of samples $S$ can be calculated as $S = E \cdot P \cdot I$, where $S$ is the total sum of sampled data, $E$ is the number of samples in each episode, $P$ and $I$ is the number of rollout processes in parallel and policy iterations, respectively. This section aims to perform analysis and spur discussion on the impact of the parallel rollout process on the final performance of QMIX.

Figure 6: Given the total number of samples, fewer processes achieve better performance.


Results Still, we use Adam-supported QMIX to evaluate the effect of the number of the rollout process. Since we could choose the Parallel model to sample the interacting data of the agent with the environment in PyMARL, we can theoretically get more on-policy data which is close to the updating policy in training. Figure 6 shows that when $S$ and $P$ is given, the performance enhancement of QMIX is not consistent with the increase of rollout process number. The intuitive explanation is when we set the fewer number of rollout processes, the greater the quantity of policy would iterate [14]. Besides, too fast updated data in parallel may cause the factitious unstable training in policy updating, i.e., it is difficult for agents to learn effective information from rapidly sampled data from replay buffer. The more times policies are iterated, the more information the agents would learn and lead to an increase in performance. However, it also causes longer training time and loss of stability. We suggest trying the fewer rollout process in the beginning and then balancing between training time and performance.

Replay Buffer Size

Replay buffer plays an important role in improving sample efficiency in off-policy single-agent RL. Its capacity would greatly affect the performance and stability of algorithms. Researchers usually set a very large capacity of replay buffer in Deep Q-network (DQN) [4] to stabilize the training. Some research of the effect of replay buffer in single-agent RL has already been carried out in [22] , which poses the distribution of sampled training data should be close as possible to the agents’ policies to be updated. Actually, there are two factors affected when we change the capacity of the replay buffer: (1) the replay capacity (total number of transitions/episodes stored in the buffer); and (2) the replay ratio (the number of gradient updates per environment transition/episode) of old policies. When we increase the capacity of replay buffer, the aged experiences of old policies would grow as the replay ratio fixed. Then the distribution of outdated experiences would also be much different from the updating policy, which would bring an additional difficulty to the training agents. From the results in [22], there seems to be an optimal range of choices between replay buffer size and replay ratio of experiences in RL, where we would like to know whether it is consistent with the results in MARL.

Figure 7: Setting the replay buffer size to 5000 episodes allows for QMIX’s learning to be stable.


Results The results seem not to be consistent with that in single-agent RL. Figure 7 shows the large replay buffer size of QMIX would cause instability during training. When we increase the buffer size from the default setting in PyMARL, the performance would almost continuously decline. We speculate the reason is the fast-changing distribution of experiences in a larger buffer would make it more difficult to fit sampled data due to the enormous joint action space. Since the samples become obsolete more quickly, these aged policies would also be more different from the updating policy, which brings additional difficulty. On the other hand, we find the same performance decline when we squeeze the buffer. We reckon that an insufficient buffer would accelerate the updating speed of sampling data in a disguised way, which makes it tough to fit the data and learn a good policy. We believe the default setting of replay buffer size in QMIX is satisfactory in this framework, and researchers should be cautious to increase the buffer size in other multi-agent applications.

Eligibility Traces

The well-known trade-off between bias and variance of bootstrapping paradigm is a classic research topic in RL. Since we implement the Centralized Value Function (CVF) to alleviate the Non-Stationarity multi-agent settings, the estimated accuracy of CVF is critical to MARL and then guides the policies of agents to update. Eligibility traces such as TD($\lambda$)[14], Peng’s Q($\lambda$)[2], and TB($\lambda$)[7] achieve a balance between return-based algorithms (where return refers to the sum of discounted rewards $\sum_{t} \gamma^{t} r_{t}$) and bootstrap algorithms (where return refers $r_t + V(s_{t+1})$), then speed up the convergence of agents’ policies. As a pioneer, SMIX [20] equipped QMIX with the SARSA($\lambda$) to estimate the accurate CVF and get decent performance. As another example of eligibility trace in Q-learning, we study the estimation of CVF using Peng’s Q$(\lambda)$ for QMIX.

Figure 8: Q(λ) significantly improves performance of QMIX, but large values of λ lead to instability in the algorithm.


Results As the same in single-agent RL, the Q-networks without sufficient training usually have a large bias in bootstrapping returns. Figure 8 shows that, with the help of Q$(\lambda)$, the performance of QMIX has generally improved across all scenarios. It means the more accurate estimate of CVF would still provide a better direction of policy updating for each agent. However, the value of $\lambda$ in Peng’s Q$(\lambda)$ is not so radical as in single-agent RL, which would lead to failed convergence due to the large variance. We recommend a consideration of around $\lambda=0.5$ when using $Q(\lambda)$ in MARL.

Hidden Size

Searching for an optimal scale and architecture of neural networks is a very tough problem in the field of machine learning. Researchers typically use empirically small networks to train the agents in deep reinforcement learning. Since the role of neural networks is to extract the features of input states and actions, the size of the neural network would also have a great impact on the performance of MARL algorithms. The study in [23] has revealed that networks with a complex structure like ResNet[25] and DenseNet[26] can extract more useful information for training, while Ba [24] poses the width of neural networks is probably more important than its depth. The subsequent study on QMIX [19] makes preliminary research on the depth of neural networks, which showed a limited improvement in performance. Though, there is little research on the width of neural networks in MARL. Instead of searching for an optimal network architecture here, we just want to make a pilot study on the effect of the hidden size of network width in QMIX.

Figure 9: Impact of hidder size of network in QMIX.


Results The study in [24] illustrates the ability of infinity width networks to fit any complex function, which would theoretically provide the performance gain from increasing network width. As shown in Figure 9, the final performance or the efficiency of policy training would have varying degrees of improvement when we increase the hidden size of the network from 64 to 256 in QMIX, where QMIX-ALL-Hidden refers to the size of the network including RNN and mixing part, while QMIX-RNN-Hidden just refers to RNN. Also, the results reveal the spectacular effect of increasing the network width of RNN, which would allow for about a 20% increase in the Super-Hard scenarios 3s5z_vs_3s6z. While the performance improvement is limited in enlarging the mixing network. We speculate that more units in the network are needed to represent the complex temporally context information in RNN, which is not included in the mixing network. We advise researchers to appropriately increase the network width of RNN to achieve better performance.

Exploration Steps

Exploration and exploitation are other classic trade-offs in reinforcement learning. Agents need some directed mechanisms to explore the states that may be of higher value or inexperienced. The most versatile method of exploration in RL is $\epsilon$-greedy action, which makes the agent select random actions with probability $\epsilon$, or select the greedy action with $1 - \epsilon$. The value of $\epsilon$ would drop down with training and then stays at a small constant. This exploration mechanism is usually implemented for each agent to select their action, which has been criticized by MAVEN [3] about lacking joint exploratory policy over an entire episode. However, we can still get more exploration when $\epsilon$ drops slower, then we evaluate the performance of the annealing period of $\epsilon$-greedy in some Super-Hard scenarios in SMAC.

Figure 10: Experinments for ε anneal period.


Results Apparently, appropriately increasing the annealing period of $\epsilon$-greedy from 100K steps to 500K would get explicit performance gain in those hard exploration scenarios, where QMIX failed with the default setting. However, as shown in Figure 10, too large steps like 1000K would also bring additional exploration noise even make the training collapse. The results above confirm the $\epsilon$-greedy mechanism is still the proper and simplest choice in MARL but should be elaboratively tuned for different tasks.

Integrating the Techniques

These techniques mentioned above indeed impacts QMIX in hard cooperative scenarios of SMAC, which really catches our attention to exhaust the extreme performance of QMIX. We combine these techniques and finetune all the hyperparameters in QMIX for each scenario of SMAC. As shown in Table 1, the finetuned-QMIX would almost conquer all the scenarios in SMAC and exceed the effect of the original QMIX with a large margin in some Hard and Super-Hard scenarios.

Table 1: Best median test win rate of Finetuned-QMIX and QMIX (batch size=128) in all scenarios.
Senarios Difficulty QMIX Finetuned-QMIX
10m_vs_11m Easy 98% 100%
8m_vs_9m Hard 84% 100%
5m_vs_6m Hard 84% 90%
3s_vs_5z Hard 96% 100%
bane_vs_bane Hard 100% 100%
2c_vs_64zg Hard 100% 100%
corridor Super hard 0% 100%
MMM2 Super hard 98% 100%
3s5z_vs_3s6z Super hard 3% 93% (Hidden Size = 256)
27m_vs_3s6z Super hard 56% 100%
6h_vs_8z Super hard 0% 93% (λ = 0.3)

Besides, we are really curious to see how these techniques mostly improve the performance of some subsequently proposed algorithms of QMIX or so. We then normalize the previous techniques for all these algorithms, i.e., we perform the same grid search pattern on typical Hard scenarios(5m_vs_6m) and Super-Hard scenario (3s5z_vs_3s6z) to find a general set of hyperparameters for each method. As shown in Table 2, QMIX still conquers the Super-hard tasks and could surpass other variants in most scenarios. In general, these variants of QMIX [9; 11; 13] that aim to relax the monotonicity constraint could not obtain better performance than QMIX to some extent. This fact demonstrates the powerful QMIX is more than just a baseline algorithm on cooperative scenarios.

Table 2: Median test-winning rate (or episode return) of MARL algorithms with normalized techniques. S-Hard denotes the Super-Hard level. We compare their performance in the most difficult scenarios of SMAC and Predator-Prey-1.
Scenarios Difficulty Algorithm
QMIX VDN Qatten QPLEX WQMIX VMIX AC-MIX
2c_vs_64zg Hard 100% 100% 100% 100% 100% 98% 100%
8m_vs_9m Hard 100% 100% 100% 95% 95% 75% 95%
3s_vs_5z Hard 100% 100% 100% 100% 100% 96% 96%
5m_vs_6m Hard 90% 90% 90% 90% 90% 9% 67%
3s5z_vs_3s6z S-Hard 75% 43% 62% 68% 56% 56% 75%
Corridor S-Hard 100% 98% 100% 96% 96% 0% 100%
6h_vs_8z S-Hard 84% 87% 82% 78% 75% 80% 19%
MMM2 S-Hard 100% 96% 100% 100% 96% 70% 100%
27m_vs_30m S-Hard 100% 100% 100% 100% 100% 93% 93%
Predator-Prey-1 - 40 39 - 39 39 39 38
Avg. Score - 94.9% 91.2% 92.7% 92.5% 90.5% 67.4% 84.0%


Role of Monotonicity Constraint

Amazing Performance in Policy-Based Methods

Figure 11: Architecture for AC-MIX: |·| denotes absolute value operation , implementing the monotonicity constraint of QMIX. W denotes the non-negative mixing weights. Agent i denotes the policy network, which can be trained end-to-end by maximizing the $Q_{tot}$.


The novelty of QMIX is the IGM continuity between $\text{argmax} Q_{tot}$ and $\text{argmax} \sum_{i}^{N} Q_{i}$, which is implemented in the mixing network. We still expect to study the role of monotonicity constraint in MARL. Therefore, we propose an actor-critic style algorithm called Actor-Critic-Mixer (AC-MIX), which has a similar architecture to QMIX. As illustrated in Figure 11, we use the monotonic mixing network as a centralized critic, which integrates $Q_{i}$ of each agent, to optimize the decentralized policy networks $π^i_{θ_i}$ in an end-to-end pattern. We still add the Adaptive Entropy [18] of each agent in the optimization object of Eq. \ref{eq3} to get more exploration, and the detail of the algorithm will be described in Appendix A.

\[\max _{\theta} \mathbb{E}_{t, s_{t}, \tau_{t}^{1}, \ldots, \tau_{t}^{n}}\left[Q_{\theta_{c}}^{\pi}\left(s_{t}, \pi_{\theta_{1}}^{1}\left(\cdot \mid \tau_{t}^{1}\right), \ldots, \pi_{\theta_{n}}^{n}\left(\cdot \mid \tau_{t}^{n}\right)\right) + \mathbb{E}_{i}\left[\mathcal{H}\left(\pi_{\theta_{i}}^{i}\left(\cdot \mid \tau_{t}^{i}\right)\right)\right]\right] \tag{3} \label{eq3}\]

Figure 12: Comparing AC-MIX w./ and w./o. monotonicity constraint (remove absolute value operation) on SMAC and Predator-Prey-2


As the monotonicity constraint on the critic of AC-MIX is theoretically no longer required as the critic is not used for greedy action selection. We can evaluate the effects of the monotonicity constraint by removing the absolute value operation in the mixing network. The results in Figure 12 demonstrate the monotonicity constraint significantly improves the performance of AC-MIX. Then to explore the generality of monotonicity constraints in the parallel sampling framework of MARL, we extend the above experiments to VMIX [12] . VMIX adds the monotonicity constraint to the value network of A2C, and learns the policy of each agent by advantage-based policy gradient [14] as illustrated in Figure 13. Still, the result from Figure 14 shows that the monotonicity constraint improves the sample efficiency in value networks.

Figure 13. Architecture for VMIX: |·| denotes absolute value operation


Figure 14: Comparing VMIX w./ and w./o. monotonicity constraint (remove absolute value operation) on SMAC


What is Under the Hood?

Observed from the results of previous experiments, the monotonicity constraints in the mixing network indeed improve performance and sample efficiency of training, but on the flip side of the coin, QMIX is still criticized for the insufficient expressive capacity of the centralized critic. The most common verifying experiment is Single-state Matrix Game, which only contains two agents with three actions each, and needs to capture the joint action $(A, A)$ as in Table 3. Actually, when we visualize the pay-offs of these matrices in Figure 15, we could find them there is a deep “ditch” between the optimal and sub-optimal joint-action, which is a representative Relative Overgeneralization pathology in multi-agent tasks.

表格
Table 3: Single-state Matrix Game
Table 3(a): Original version
a1 a2 A B C
A 8 -12 -12
B -12 0 0
C -12 0 0
Table 3(b): Hard version
a1 a2 A B C
A 12 -15 -15
B -15 9 9
C -15 9 9
Table 3(c): Easy version
a1 a2 A B C
A 6 -5 -5
B -5 0 0
C -5 0 0
图片
(a) original version
(b) hard version
(c) easy version
Figure 15: Illustrations of different level of Single-state Matrix Game corresponding to Table 3


Still, QMIX could not learn the accurate pay-offs of matrix game as the learning results in Table 4, even when we implement the full exploration (i.e., $\epsilon$=1 in $\epsilon$-greedy) during the whole training process. Researchers give a proper analysis of consistency between the deterministic greedy decentralized policies and the deterministic greedy centralized policy based on the optimal joint action-value function in [19]. As illustrated in Figure 16, the consistency of argmax operator performed on $Q_{tot}$ enforces the learning results to be monotonic, which would make $Q_{tot}$ inaccurate to estimate values in Relative Overgeneralization problems.

表格
Table 4: Learning results of QMIX in Single-state Matrix Game
Table 4(a): Original version
a1 a2 A B C
A -9.26 -9.44 -9.62
B -9.1 0 -0.05
C -9.27 -0.08 -0.6
Table 4(b): Hard version
a1 a2 A B C
A -9.83 -10.01 -10.18
B -9.65 -0.04 -0.39
C -9.82 0.16 0.01
Table 4(c): Easy version
a1 a2 A B C
A 6.43 -2.80 -2.66
B -2.50 -3.05 -2.86
C -2.66 -2.82 -2.67

Figure 16: Monotonicity in mixing network of QMIX.
(Image source: QMIX [8])


The abnormal question naturally occurred to us: (1) Why the performance of QMIX would be better than its variants like WQMIX or Qtran that aims to relax the monotonicity constraint of mixing network? (2) How to overcome the disadvantage of inaccurate $Q_{tot}$ of QMIX?

To answer these two questions we first need to reexamine the IGM principle. Since the monotonicity in QMIX is defined as a constraint on the relationship between $Q_{tot}$ and each $Q_{i}$ :

\[Q_{tot} = \sum_{i=1}^{N}w_{i}(s_{t}) \cdot Q_{i} + b(s_{t}), \\ w_{i} = \frac{\partial Q_{tot}}{\partial Q_{i}} \geq 0, \forall i \in A. \tag{4} \label{eq4}\]

From the sufficient condition above, the weight $w_{i}$ generated by hyper-network would be forced to be greater or equal to zero. To put it another way, it makes the parameter space smaller for searching $w_{i}$ weights. As illustrated in the schematic diagram 17 with just two agents, assume the red region is the original search space, the restricted search space of $w_{i}$ is the blue region in the first quadrant. Then the optimal solution in the original domain cannot be expressed correctly in the restricted region. On the other hand, the search area of exhausting the whole joint state-action space would also be decreased exponentially by $(\frac{1}{2})^{N}$ ($N$ demotes the number of $w_{i}$, as well as the number of agents). Since the essence of learning in MARL is to search for the optimal joint-policy parameterized by weights and bias of agents and mixing network, QMIX could find a satisfying policy more quickly in the reduced parameter space.

Figure 17: Diagram of parameter searching space of two agents in QMIX


As a side effect, the global optimum may not be in the parameter space that QMIX needs to search at all due to the monotonicity of the mixing network. One effective way is to estimate the $Q_{tot}$ as accurately as possible in the hope that it could find the global optimum, this probably explains why $Q(\lambda)$ in the previous section could result in such a performance improvement in SMAC. On the other hand, we could delicately design the reward function to be approximate monotonic when we use QMIX to solve cooperative multi-agent tasks. Then adapting the algorithm to the test environment is not a good idea, after all, we still need to figure out how to use QMIX more effectively or develop other more efficient algorithms.

Reproducibility and Fairness

Since experimental techniques would have such a great impact on the performance of QMIX, we need to be very careful when we treat QMIX as a baseline to compare its performance with newly proposed algorithms (especially some composite algorithms). Some cooperative tasks may only have a few simple metrics to estimate the capacity of the algorithm (just as the win rates of different scenarios in SMAC), it is still unpersuasive the performance improvement comes from an elaborate design that is specific to the cooperative tasks or just the fine-tuned techniques and hyperparameters. To ensure continued progress in MARL, we are eager for the community to start a discussion about fair comparisons among algorithms and propose a rigorous set of criteria to judge the contribution of new algorithms. We believe we community members also should consider what are the best ways to demonstrate that MARL continues to matter as RL.

Appendix

A Pseudo-code of AC-MIX

In this section, we show the pseudo-code for the training procedure of AC-MIX. (1) Training the critic network with offline samples and 1-step TD error loss improves the sample efficiency for critic networks; (2) We find that policy networks are sensitive to old samples reuse. Training policy networks end-to-end and critic with TD($\lambda$) and online samples improve the learning stability of AC-MIX.

B HYPERPARAMETERS

In this section, we present our hyperparameters tuning process. We get the optimal hyperparameters for each algorithm by grid search, shown in Table 5. Specifically,

  1. For experiments in Table 1, we perform a hyperparameter search on each scenario for QMIX to demonstrate the best performance of QMIX.
  2. For experiments in Table 2, we perform grid search schemes on a typical hard environment (5m_vs_6m) and super hard environment (3s5z_vs_3s6z) to find a general set of hyperparameters for each algorithm. In this way, we can evaluate the robustness of these MARL algorithms.
Table 5: Hyperparameters Search on SMAC.
Tricks Value-based(VB) Policy-bassed(PG)
Optimizer Adam,RMSProp Adam,RMSProp
Learning Rates 0.0005, 0.001 0.0005, 0.001
Batch Size (episodes) 32, 64, 128 32, 64
Replay Buffer Size 5000, 10000, 20000 2000, 5000, 10000
Q(λ)/TD(λ) 0, 0.3, 0.6, 0.9 0.3, 0.6, 0.8
Entropy/Adaptive Entropy - 0.005, 0.01, 0.03, 0.06
ε Anneal Steps 50K, 100K, 500K, 1000K -


Table 6: Hyperparameters Settings.
Algorithms QMIX OurQMIX Qatten OurQatten QPLEX OurQPLEX WQMIX OurWQMIX AC-MIX
Optimizer RMSProp Adam RMSProp Adam RMSProp Adam RMSProp Adam Adam
Batch Size (eps) 32 128 32 128 32 128 32 128 32(on)/64(off)
Q(λ)/TD(λ) 0 0.6 0 0.6 0 0.6 0 0.6 0.6
Attention Heads - - 4 4 10 4 - - -
Mixing-Net Size 41K 41K 58K 58K 476K 152K 247K 247K 69K
ε Anneal Steps 50K→500K for 6h_vs_8z, 100K for others -
Processes Num 8 8 1 8 1 8 1 8 8


Table 6 shows our general settings for these algorithms. The network size is calculated under 6h_vs_8z, where adding Our denotes the fine-tuned hyperparameter settings. Next, we describe in detail the setting of these hyperparameters.

Neural Network Size. We first ensure the network size is the same order of magnitude, i.e, we use 4 attention heads leading the mixing-net size of QPLEX from 476K to 152K. All the hidden size of agent networks is 64, the same as those found in QMIX [8].

Optimizer & Learning Rate. We use Adam to optimize all networks, except VMIX (works better with RMSProp), as it may accelerate the convergence of the algorithms. All neural networks are trained with a 0.001 learning rate.

Batch Size. As we find that a large batch size helps to improve the stability of the algorithms. For all value-based algorithms, we set the batch size to 128. For the policy-based algorithms, we set the batch size to 64/32 (Offline/Online training) due to the fact that online update requires only the newest data.

Replay Buffer Size. As discussed in previous sections, a small replay buffer size facilitates the convergence of the MARL algorithms. Therefore, for SMAC, the size of all replay buffers is set to 5000 episodes. For Predator-Prey, we set the buffer size to 1000 episodes.

Exploration. As discussed in previous sections, we use $\epsilon$-greedy action selection, decreasing $\epsilon$ from 1 to 0.05 over n-time steps (n can be found in Table 6) for value-based algorithms. For VMIX, we use the policy entropy loss and fine-tune the coefficients for different scenarios.

N-step returns. We find that the $\lambda$ values of Q($\lambda$) and TD($\lambda$) are heavily dependent on the algorithms and scenarios. We are using $\lambda$ = 0.6 for all tasks as it works stably in most scenarios. And, for the on-policy algorithm VMIX, we set $\lambda$ = 0.8.

Rollout Processes Number. For SMAC and Predator-Prey-1, 8 rollout processes for parallel sampling are used to obtain as many samples as possible from the environments at a high rate. And, 4 rollout processes are used for Predator-Prey-2. All the algorithms use the same number of processes to ensure the same number of policy iterations.

Other Settings. We set all discount factors $\gamma$ = 0.99. We update the target network every 200 episodes. We find that the optimal hyperparameters of the value-based algorithms are similar due to the fact that they share the same basic architecture and training paradigm. Therefore, the settings for VDNs are the same as for QMIX.

Reference

[1] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR 2015, San Diego, CA, USA, May 7-9, 2015, 2015.

[2] Tadashi Kozuno, Yunhao Tang, Mark Rowland, Remi Munos, Steven Kapturowski, Will Dabney, Michal Valko, and David Abel. Revisiting peng’s q (λ) for modern reinforcement learning. arXiv preprint arXiv:2103.00107, 2021.

[3] Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. MAVEN: multi-agent variational exploration. In NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada,pp. 7611–7622, 2019.

[4] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

[5] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 1928–1937, 2016.

[6] Sylvie CW Ong, Shao Wei Png, David Hsu, and Wee Sun Lee. Pomdps for robotic tasks with mixed observability. 5:4, 2009.

[7] Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. In ICML 2000, Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pp.759–766. Morgan Kaufmann, 2000.

[8] Tabish Rashid, Mikayel Samvelyan, Christian Schr ̈oder de Witt, Gregory Farquhar, Jakob N.Foerster, and Shimon Whiteson. QMIX: monotonic value function factorization for deep multi-agent reinforcement learning. In ICML 2018, Stockholmsmassan, Stockholm, Sweden, July10-15, 2018, pp. 4292–4301, 2018.

[9] Tabish Rashid, Gregory Farquhar, Bei Peng, and Shimon Whiteson. Weighted QMIX: Expand-ing Monotonic Value Function Factorisation. arXiv preprint arXiv:2006.10800, 2020.

[10] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, NantasNardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, and ShimonWhiteson. The StarCraft Multi-Agent Challenge.arXiv preprint arXiv:1902.04043, 2019.

[11] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Hostallero, and Yung Yi. QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 5887–5896, 2019.

[12] Jianyu Su, Stephen Adams, and Peter A. Beling. Value-Decomposition Multi-Agent Actor-Critics. arXiv:2007.12306, 2020.

[13] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi,Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Grae-pel. Value-Decomposition Networks For Cooperative Multi-Agent Learning. arXiv preprint arXiv:1706.05296, 2017.

[14] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

[15] Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. QPLEX: Duplex Dueling Multi-Agent Q-Learning. arXiv:2008.01062, 2020.

[16] Ermo Wei, Drew Wicke, David Freelan, and Sean Luke. Multiagent Soft Q-Learning. arXivpreprint arXiv:1804.09817, 2018.

[17] Yaodong Yang, Jianye Hao, Ben Liao, Kun Shao, Guangyong Chen, Wulong Liu, and HongyaoTang. Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning. arXiv preprint arXiv:2002.03939, 2020.

[18] Ming Zhou, Jun Luo, and Julian Villella et al. Smarts: Scalable multi-agent reinforcement learning training school for autonomous driving, 2020.

[19] Rashid T, Samvelyan M, Schroeder de Witt C, et al. Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research, 2020, 21.

[20] Wen C, Yao X, Wang Y, et al. Smix (λ): Enhancing centralized value functions for cooperative multi-agent reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(05): 7301-7308.

[21] Hinton G, Srivastava N, Swersky K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 2012, 14(8): 2.

[22] Fedus W, Ramachandran P, Agarwal R, et al. Revisiting fundamentals of experience replay. International Conference on Machine Learning. PMLR, 2020: 3061-3071.

[23] Ota K, Oiki T, Jha D, et al. Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?. International Conference on Machine Learning. PMLR, 2020: 7424-7433.

[24] Ba L J, Caruana R. Do deep nets really need to be deep?. arXiv preprint arXiv:1312.6184, 2013.

[25] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

[26] Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4700-4708.

[27] Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep reinforcement learning. International conference on machine learning. PMLR, 2016: 1995-2003.