<!DOCTYPE html>
<html lang="en-us">

  <head>
  <link href="http://gmpg.org/xfn/11" rel="profile">
  <meta http-equiv="content-type" content="text/html; charset=utf-8">

  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">

  <title>
    
      Implementations that Matter in Cooperative Multi-Agent Reinforcement Learning &middot; The ICLR Blog Track
    
  </title>

  
  <link rel="canonical" href="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/2021/12/01/Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/">
  

  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/css/poole.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/css/syntax.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/css/lanyon.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/css/custom.css">
  <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">

  <link rel="apple-touch-icon-precomposed" sizes="144x144" href="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/apple-touch-icon-precomposed.png">
  <link rel="shortcut icon" href="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/favicon.ico">

  <link rel="alternate" type="application/rss+xml" title="RSS" href="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/atom.xml">

  

  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript" ></script>
 <!-- <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } } }); </script> -->
  <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ],
         processEscapes: false
        }
      });
</script>
</head>


  <body>

    <!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
     styles, `#sidebar-checkbox` for behavior. -->
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">
<!-- <input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox" > -->

<!-- Toggleable sidebar -->
<div class="sidebar" id="sidebar">
  <div class="sidebar-item">
    <p>For short-term, peer-sourced tests of time, generalizations, specializations, reproductions, etc.!</p>
  </div>

  <nav class="sidebar-nav">

    

    
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/">ICLR 2022 Blog Track</a>
        
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/about/">About</a>
        
      
    
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/submitting/">Submitting</a>
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/tags/">Tags</a>
        
      
    

    <a class="sidebar-nav-item" href="https://github.com/iclr-blog-track/iclr-blog-track.github.io">GitHub project</a>
    <span class="sidebar-nav-item">Currently vICLR Spring 2021</span>
  </nav>

  <div class="sidebar-item">
    <p>
      &copy; 2022. All rights reserved.
    </p>
  </div>
</div>


    <!-- Wrap is the content to shift when toggling the sidebar. We wrap the
         content to avoid any CSS collisions with our real content. -->
    <div class="wrap">
      <div class="masthead">
        <div class="container">
          <h3 class="masthead-title">
            <a href="/" title="Home">The ICLR Blog Track</a>
            <small></small>
          </h3>
        </div>
      </div>

      <div class="container content">
        <div class="post">
  <h1 id="iclr-post-title" class="post-title">Implementations that Matter in Cooperative Multi-Agent Reinforcement Learning</h1>
  <span class="post-date">01 Dec 2021 | 
    <a class="content-tag" href="/tags/#multi-agent"> multi-agent </a>
  
    <a class="content-tag" href="/tags/#reinforcement-learning"> reinforcement-learning </a>
  
    <a class="content-tag" href="/tags/#experimental-techniques"> experimental techniques </a>
  
    <a class="content-tag" href="/tags/#monotonicity"> monotonicity </a>
  </span>

  <span id="iclr-post-authors" class="post-date">Anonymous</span>
  <blockquote>
  <p>Multi-agent reinforcement learning would like to implement some techniques that are known to have superior improvements in reinforcement learning. However, it is still unclear which of these techniques are complementary and can be fruitfully combined. This post dives into some proper experimental techniques that are controversial in multi-agent settings and gives out some advice to enhance the performance of MARL algorithms.</p>
</blockquote>

<p>Given the powerful decision-making ability of reinforcement learning (RL), it is a general trend to apply it with some complementary methods in the field of multi-agent cooperative settings. However, these methods may not be universal. In this post, we would like to figure out some techniques that have been proved effective in RL while posing controversy and ambiguousness on QMIX [<a href="#8">8</a>], the representative baseline algorithm in multi-agent reinforcement learning (MARL). These inconsistencies also attract our attention to exploring the extreme performance of QMIX on StarCraft Multi-Agent Challenge (SMAC) [<a href="#10">10</a>], especially in the hardest scenarios. Furthermore, we briefly explore the role of monotonicity constraint in the mixing network and give out our proposals to enhance the performance of MARL algorithms under the CTDE paradigm. Still, we also aim to spur discussion about what matters in cooperative multi-agent scenarios, even the reproducibility, and comparability in MARL. We open-source the code at https://github.com/xxxx/xxxx (Anonymous) for researchers to evaluate the effects of these proposed techniques and other fair comparisons between algorithms.</p>

<ul>
  <li><a href="#From_RL_to_MARL">From RL to MARL</a></li>
  <li><a href="#QMIX_and_Monotonicity_Constraint">QMIX and Monotonicity Constraint</a></li>
  <li><a href="#Extension_to_QMIX">Extension to QMIX</a>
    <ul>
      <li><a href="#Experimental_Design">Experimental Design</a></li>
      <li><a href="#Optimizer">Optimizer</a></li>
      <li><a href="#Rollout_Process_Number">Rollout Process Number</a></li>
      <li><a href="#Replay_Buffer_Size">Replay Buffer Size</a></li>
      <li><a href="#Eligibility_Traces">Eligibility Traces</a></li>
      <li><a href="#Hidden_Size">Hidden Size</a></li>
      <li><a href="#Exploration_Steps">Exploration Steps</a></li>
    </ul>
  </li>
  <li><a href="#Integrating_the_Techniques">Integrating the Techniques</a></li>
  <li><a href="#Role_of_Monotonicity_Constraint">Role of Monotonicity Constraint</a>
    <ul>
      <li><a href="#Amazing_Performance_in_Policy-Based_Methods">Amazing Performance in Policy-Based Methods</a></li>
      <li><a href="#What_is_Under_the_Hood">What is Under the Hood?</a></li>
    </ul>
  </li>
  <li><a href="#Reproducibility_and_Fairness">Reproducibility and Fairness</a></li>
  <li><a href="#Appendix">Appendix</a></li>
  <li><a href="#Reference">Reference</a></li>
</ul>

<h1 id="from-rl-to-marl"><a name="From_RL_to_MARL">From RL to MARL</a></h1>
<p>Ever since AlphaGo beats humans at Go, RL has become a consistent hot spot in both academia and industry. The agent of RL can obtain some rewards by interacting with the environment and taking actions to maximize these cumulative rewards. Actually, almost all the RL problems can be described as <strong>Markov Decision Processes</strong> as illustrated in Figure <a href="#mdp">1</a>.</p>

<p><a name="mdp"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/mdp.png" height="200px" style="margin: 0 auto;" /> </a></p>
<center>Figure 1: The agent-environment interaction in a Markov decision process. (Image source: Sec. 3.1 Sutton &amp; Barto (2017) <a ref="#14">[14]</a>)).</center>
<p><br /></p>

<p>Just as its name implies, MARL contains multiple agents trained by RL algorithms in the same environment. Many complex multi-agent systems such as robot swarms control, autonomous vehicle coordination, and sensor networks, can be modeled as MARL tasks. The interaction of these agents would make them work together to achieve a common goal.</p>

<div style="display:flex; margin:20px 0; gap:5px"><a name="chase"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/chase.gif" height="200px" style="margin: 0 auto;" /> </a>
<a name="magent"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/magent.gif" height="200px" style="margin: 0 auto;" /> </a>
<a name="hide"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/hide.gif" height="200px" style="margin: 0 auto;" /> </a>
<a name="smac"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/smac.gif" height="200px" style="margin: 0 auto;" /> </a></div>

<div style="margin-bottom: 20px"><center>Figure 2: Some multi-agent cooperative scenarios [left-to-right].
<a href="https://github.com/openai/multiagent-particle-envs"> (a) Chasing in Multi-Agent Particle Environment (Predator-Prey); </a><br />
<a href="https://github.com/geek-ai/MAgent"> (b) MAgent Environment; </a>
<a href="https://openai.com/blog/emergent-tool-use"> (c) Hide &amp; Seek; </a>
<a href="https://github.com/oxwhirl/smac"> (d) StarCraft Multi-Agent Challenge. </a></center></div>

<p>Actually, agents usually have a limited sight range to observe their surrounding environment. As the example shown in Figure <a href="#smac_obs">3</a>, the cyan border indicates the sight and shooting range of the agent, which means the agent could only obtain the information of terrain or other agents in that range. These kinds of multi-agent tasks can be modeled as decentralized partially observable Markov decision process (Dec-POMDP) [<a href="#6">6</a>], and the ultimate goal is to find a joint policy of agents $\boldsymbol{\pi} = \langle \pi_{1},…,\pi_{n}\rangle$ to get the maximal global reward.</p>

<div style="float:left; margin-right :40px"><a name="smac_obs"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/smac_agent_obs.jpg" height="300px" /> </a>
<center>Figure 3: The partial observation of agents<br />(Image source: SMAC <a ref="#10">[10]</a>). </center><br /></div>

<p>Apparently, the main challenges stand between MARL and practical applications include the inherent communication constraints, partial observability, and the <em>Non-Stationarity</em> resulting from the changing policies of other agents. These challenges make it troublesome for agents to achieve better cooperation and lead to unstable learning. A setting known as <em>Centralized Training with Decentralized Execution</em>  (CTDE) [<a href="#15">15</a>] has been proposed to meet these challenges. It trains the policies in a centralized way, which would access the global state $s$ and local action-observation histories of all agents. However, each agent can only make its own decision based on its local action-observation history $\tau^{i}$ during execution. The nonstationarity in training would be alleviated by learning a shared centralized value function for all agents. In the algorithms that integrate each agent’s $Q_{i}$ together, QMIX is the representative and effective method to train the agents.<br /><br /></p>

<h1 id="qmix-and-monotonicity-constraint"><a name="QMIX_and_Monotonicity_Constraint">QMIX and Monotonicity Constraint</a></h1>

<p>To deal with the relationship between the individual agent and the cooperative group, QMIX [<a href="#8">8</a>] learns a joint action-value function $Q_{tot}$, and factorizes the joint policy to the individual policy of each agent. In other words, as illustrated in Figure <a href="#frame">4</a>, QMIX integrates all the individual $Q_{i}$ with a mixing network to obtain a centralized value function $Q_{tot}$, which can be more appropriately updated by the global reward.</p>

<p><a name="frame"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/qmix_frame.png" height="320px" style="margin: 0 auto;" /> </a></p>
<center>Figure 4: Framework of QMIX. (Image source: QMIX <a ref="#8">[8]</a>) </center>
<p><br /></p>

<p>Still, it also can be represented in Eq.(\ref{eq1})</p>

\[Q_{tot}(s, \boldsymbol{u} ; \boldsymbol{\theta}, \phi)
= g_{\phi}\left(s, Q_{1}\left(\tau^{1}, u^{1} ; \theta^{1}\right), \ldots, Q_{N}\left(\tau^{N}, u^{N} ;  \theta^{N}\right)\right) \\ \frac{\partial Q_{tot}(s, \boldsymbol{u} ; \boldsymbol{\theta}, \phi)}{\partial Q_{i}\left(\tau^{i}, u^{i}; \theta^{i}\right)} \geq 0, \quad \forall i \in \mathcal{N} \tag{1} \label{eq1}\]

<p>where $\theta^i$ is the parameter of the agent network $i$,  and $\phi$ is the trainable parameter of the mixing network, which is responsible to factorize $Q_{tot}$ to each agent $Q_{i}$. The <em>Monotonicity Constraint</em> is implemented in the mixing network, which inputs the global state $s$ and outputs nonnegative wights through <em>hypernetwork</em>. This delicate design ensures consistency between joint actions and the individual actions of each agent, then guarantees the Individual-Global-Max (IGM) principle. Benefiting from the monotonicity constraint in Eq. (\ref{eq1}), maximizing joint $Q_{tot}$ is precisely the equivalent of maximizing individual $Q_i$, which would also allow the optimal individual action to maintain consistency with optimal joint action. Furthermore, QMIX learns centralized value function $Q_{tot}$ by sampling a multitude of transitions from the replay buffer and minimizing the mean squared temporal-difference (TD) error loss:</p>

\[\mathcal{L}(\theta)= \frac{1}{2} \sum_{i=1}^{b}\left[\left(y_{i}^{}-Q_{tot}(s, u ; \theta, \phi)\right)^{2}\right] \tag{2} \label{eq2}\]

<p>where the TD target value $y=r+\gamma \underset{u^{\prime}}{\operatorname{max}} Q_{tot}(s^{\prime},u^{\prime};\theta^{-},\phi^{-})$, and $\theta^{-}, \phi^{-}$ are the target network parameters copied periodically from the current network and kept constant for a number of iterations.</p>

<p>It is not surprising there are so many subsequently developed variant algorithms of QMIX, which aim to relax the monotonicity constraint or learn a more stable and generalizable centralized value function. As a pioneer, Value-Decomposition Network (VDN) [<a href="#13">13</a>] only requires a linear decomposition where $Q_{tot} = \sum_{i}^{N} Q_i$, which also can be regarded as relaxing the monotonicity constraint. Qatten [<a href="#17">17</a>] introduces an attention mechanism to determine the proportion of each agent based on their observations. QTRAN [<a href="#11">11</a>] learns the discrepancy between $Q_{tot} = \sum_{i}^{N} Q_i$ and $Q_{tot}$, which would factorize the centralized critic function and train all the agents in an end-to-end way. QPLEX [<a href="#15">15</a>] transfers the monotonicity constraint from Q values to Advantage values [<a href="#27">27</a>], and introduces a duplex transformed network to integrate the state information. WQMIX [<a href="#9">9</a>] scales down the estimated centralized value of non-optimal joint actions, and further relaxes the monotonicity constraint with a true value network and some theoretical constraints. SMIX [<a href="#20">20</a>] enhances the QMIX by incorporating lite SARSA($\lambda$) in centralized critic, and MAVEN [<a href="#3">3</a>] introduces the <em>committed exploration</em> to persist joint exploratory policies for all the agents over an entire episode. VMIX [<a href="#12">12</a>] combines the Advantage Actor-Critic (A2C) [<a href="#5">5</a>] with QMIX to extend the monotonicity constraint to critic networks.</p>

<p>Since all these subsequent developed algorithms show their performance exceeds QMIX in SMAC, there is a question that is been plaguing us: is the performance of QMIX less than expected due to improper training parameters or techniques? We wish to know what kind of techniques would affect the performance of QMIX or even other cooperative MARL algorithms.</p>

<h1 id="extension-to-qmix"><a name="Extension_to_QMIX">Extension to QMIX</a></h1>
<h2 id="experimental-design"><a name="Experimental_Design">Experimental Design</a></h2>
<p>To facilitate the study of proper techniques affecting the training effectiveness and sample efficiency of QMIX, we perform a set of experiments designed to provide insight into some methods that have been proved effective in single-agent RL but may be ambiguous in MARL. In particular,  we investigate the effects of: <strong>Adam optimizer with parallel rollout process; the incremental of replay buffer size; the number of parallel rollout process; $\epsilon$-exploration steps; the implementation of $Q(\lambda)$ in centralized value function; the hidden size of agents’ recurrent network. And we also study the role of monotonicity constraints in QMIX.</strong> For all experiments, we generally use PyMARL [<a href="#10">10</a>] framework to implement QMIX and its variants. To ensure fairness we run independent five experimental trials for each evaluation, each with a random seed. Unless otherwise mentioned, we use default settings as in PyMARL whenever possible, while incorporating the techniques of interest. All results are plotted with the median and shaded the interval.</p>

<p><strong>StarCraft Multi-Agent Challenge (SMAC)</strong> As a commonly used testing environment, SMAC [<a href="#10">10</a>]  sets an example to offer a great opportunity to tackle the cooperative control problems in the multi-agent domain. We focus on the micromanagement challenge in SMAC, which means each agent is controlled by an independent agency that conditions on a limited observation area, and these groups of units are trained to conquer the enemy consisting of built-in AI. According to the quantity and type of enemy, all testing scenarios could be divided into <em>Easy, Hard</em>, and <em>Super-Hard</em> levels. Since QMIX can effectively solve the <em>Easy</em> tasks, we pay our attention to some <em>Hard</em> and <em>Super-Hard</em> scenarios that QMIX failed to win, especially in <em>Corridor, 3s5z_vs_3s6z</em>, and <em>6h_vs_8z</em>.</p>

<p><strong>Predator-Prey (PP)</strong>  is representative of another classical problem called <em>relative overgeneralization</em> [<a href="#16">16</a>] . The cooperating predators are trained to chase a faster running prey, and hope to capture this escaping robot with the fewest steps possible. We leverage two kinds of difficulty-enhanced Predator-Prey variants of environments to test the algorithms: (1) Predator-Prey-1 (PP-1) requires two predators to catch the prey at the same time to get a reward; (2) and Predator-Prey-2 (PP-2), whose policy of prey is replaced with a hard-coded heuristic policy, asks the prey to move to the farthest sampled position to the predator. These two environments require greater cooperation between agents.</p>

<h2 id="optimizer"><a name="Optimizer">Optimizer</a></h2>
<p>As an important part of training neural networks, the selection of an optimizer is very important since it could seriously affect the training effect of the reinforcement learning agent. Without a further illustration, QMIX and other variant algorithms use RMSProp [<a href="#21">21</a>]  to optimize the neural networks of agents as they prove stable in SMAC. While Adam [<a href="#1">1</a>]  is famous for the fast convergence benefiting from the momentum in training, which seems to be the first choice for AI researchers. We reckon that momentum property in Adam would have some advantages in learning the sampled data which is generated by agents interacting with the environment as in MARL. And then, on the other hand, QMIX is criticized for performing sub-optimally and sample inefficiency when equipped with the A2C framework, which is implemented to promote the training efficiency of the RL algorithm. VMIX [<a href="#12">12</a>] argues this limitation is brought about by the value-based inherent Q function, so they extend QMIX to the actor-critic style algorithm to take advantage of the A2C framework. This controversy attracts our attention to evaluate the performance of QMIX using Adam, as well as the parallel sampling paradigm.</p>

<p><a name="optimizer"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/optimizer.png" height="210px" style="margin: 0 auto;" /> </a></p>
<center>Figure 5: The Q networks optimized by Adam and RMSProp.</center>
<p><br /></p>

<p><strong>Results</strong> As shown in Figure <a href="#optimizer">5</a>, we run the Adam-supported QMIX with <strong>8 rollout processes</strong>. Different from what was described in VMIX, the performance and efficiency of QMIX could be greatly improved by Adam. We speculate the reason is the momentum property in Adam could fastly fit the newly sampled data from the parallel rollout processes and then enhance the performance, while RMSProp failed. Hence the limitation posed by VMIX is most likely due to the selection of improper optimizers. Actually, Adam can still be an important consideration in MARL.</p>

<h2 id="rollout-process-number"><a name="Rollout_Process_Number">Rollout Process Number</a></h2>
<p>Naturally, we come to focus on the benefits of parallel data sampling in QMIX. A2C [<a href="#5">5</a>] provides an excellent example to reduce training time and improve the training efficiency in single-agent RL. As we implement the algorithms under the paradigm of A2C, there is usually a defined total number of samples and an unspecified number of rollout processes. The total number of samples $S$ can be calculated as $S = E \cdot P \cdot I$, where $S$ is the total sum of sampled data, $E$ is the number of samples in each episode, $P$ and $I$ is the number of rollout processes in parallel and policy iterations, respectively. This section aims to perform analysis and spur discussion on the impact of the parallel rollout process on the final performance of QMIX.</p>

<p><a name="process_number">  <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/process_number.png" height="210px" style="margin: 0 auto;" /> </a></p>
<center>Figure 6: Given the total number of samples, fewer processes achieve better performance.</center>
<p><br /></p>

<p><strong>Results</strong> Still, we use Adam-supported QMIX to evaluate the effect of the number of the rollout process. Since we could choose the <em>Parallel</em> model to sample the interacting data of the agent with the environment in PyMARL, we can theoretically get more <strong>on-policy</strong> data which is close to the updating policy in training. Figure <a href="#process_number">6</a> shows that when $S$ and $P$ is given, the performance enhancement of QMIX is not consistent with the increase of rollout process number. The intuitive explanation is when we set the fewer number of rollout processes, the greater the quantity of policy would iterate [<a href="#14">14</a>]. Besides, too fast updated data in parallel may cause the factitious unstable training in policy updating, i.e., it is difficult for agents to learn effective information from rapidly sampled data from replay buffer. The more times policies are iterated, the more information the agents would learn and lead to an increase in performance. However, it also causes longer training time and loss of stability. We suggest trying the fewer rollout process in the beginning and then balancing between training time and performance.</p>

<h2 id="replay-buffer-size"><a name="Replay_Buffer_Size">Replay Buffer Size</a></h2>
<p>Replay buffer plays an important role in improving sample efficiency in off-policy single-agent RL. Its capacity would greatly affect the performance and stability of algorithms. Researchers usually set a very large capacity of replay buffer in Deep Q-network (DQN) [<a href="#4">4</a>]  to stabilize the training. Some research of the effect of replay buffer in single-agent RL has already been carried out in [<a href="#22">22</a>] , which poses the distribution of sampled training data should be close as possible to the agents’ policies to be updated. Actually, there are two factors affected when we change the capacity of the replay buffer: (1) the replay capacity (total number of transitions/episodes stored in the buffer); and (2) the replay ratio (the number of gradient updates per environment transition/episode) of old policies. When we increase the capacity of replay buffer, the aged experiences of old policies would grow as the replay ratio fixed. Then the distribution of outdated experiences would also be much different from the updating policy, which would bring an additional difficulty to the training agents. From the results in [<a href="#22">22</a>], there seems to be an optimal range of choices between replay buffer size and replay ratio of experiences in RL, where we would like to know whether it is consistent with the results in MARL.</p>

<p><a name="replay_buffer"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/buffer_size.png" height="210px" style="margin: 0 auto;" /> </a></p>
<center>Figure 7:  Setting the replay buffer size to 5000 episodes allows for QMIX’s learning to be stable.</center>
<p><br /></p>

<p><strong>Results</strong> The results seem not to be consistent with that in single-agent RL. Figure <a href="#replay_buffer">7</a> shows the large replay buffer size of QMIX would cause instability during training. When we increase the buffer size from the default setting in PyMARL, the performance would almost continuously decline. We speculate the reason is the fast-changing distribution of experiences in a larger buffer would make it more difficult to fit sampled data due to the enormous joint action space. Since the samples become obsolete more quickly, these aged policies would also be more different from the updating policy, which brings
additional difficulty. On the other hand, we find the same performance decline when we squeeze the buffer. We reckon that an insufficient buffer would accelerate the updating speed of sampling data in a disguised way, which makes it tough to fit the data and learn a good policy. We believe the default setting of replay buffer size in QMIX is satisfactory in this framework, and researchers should be cautious to increase the buffer size in other multi-agent applications.</p>

<h2 id="eligibility-traces"><a name="Eligibility_Traces">Eligibility Traces</a></h2>
<p>The well-known trade-off between bias and variance of bootstrapping paradigm is a classic research topic in RL. Since we implement the Centralized Value Function (CVF) to alleviate the <em>Non-Stationarity</em>   multi-agent settings, the estimated accuracy of CVF is critical to MARL and then guides the policies of agents to update. Eligibility traces such as TD($\lambda$)[<a href="#14">14</a>], Peng’s Q($\lambda$)[<a href="#2">2</a>], and TB($\lambda$)[<a href="#7">7</a>] achieve a balance between return-based algorithms (where return refers to the sum of discounted rewards $\sum_{t} \gamma^{t} r_{t}$) and bootstrap algorithms (where return refers $r_t + V(s_{t+1})$), then speed up the convergence of agents’ policies. As a pioneer, SMIX [<a href="#20">20</a>]  equipped QMIX with the SARSA($\lambda$) to estimate the accurate CVF and get decent performance. As another example of eligibility trace in Q-learning, we study the estimation of CVF using Peng’s Q$(\lambda)$ for QMIX.</p>

<p><a name="qlambda1">  <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/td_lambda.png" height="210px" style="margin: 0 auto;" /> </a></p>
<center>Figure 8:  Q(λ)  significantly improves performance of QMIX, but large values of λ lead to instability in the algorithm.</center>
<p><br /></p>

<p><strong>Results</strong> As the same in single-agent RL, the Q-networks without sufficient training usually have a large bias in bootstrapping returns. Figure <a href="#qlambda1">8</a> shows that, with the help of Q$(\lambda)$, the performance of QMIX has generally improved across all scenarios. It means the more accurate estimate of CVF would still provide a better direction of policy updating for each agent. However, the value of $\lambda$ in Peng’s Q$(\lambda)$ is not so radical as in single-agent RL, which would lead to failed convergence due to the large variance. We recommend a consideration of around $\lambda=0.5$ when using $Q(\lambda)$ in MARL.</p>

<h2 id="hidden-size"><a name="Hidden_Size">Hidden Size</a></h2>
<p>Searching for an optimal scale and architecture of neural networks is a very tough problem in the field of machine learning. Researchers typically use empirically small networks to train the agents in deep reinforcement learning. Since the role of neural networks is to extract the features of input states and actions, the size of the neural network would also have a great impact on the performance of MARL algorithms. The study in [<a href="#23">23</a>]  has revealed that networks with a complex structure like ResNet[<a href="#25">25</a>] and DenseNet[<a href="#26">26</a>] can extract more useful information for training, while Ba [<a href="#24">24</a>] poses the width of neural networks is probably more important than its depth. The subsequent study on QMIX [<a href="#19">19</a>] makes preliminary research on the depth of neural networks, which showed a limited improvement in performance. Though, there is little research on the width of neural networks in MARL. Instead of searching for an optimal network architecture here, we just want to make a pilot study on the effect of the hidden size of network width in QMIX.</p>

<p><a name="hiddensize"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/hidden_size.png" height="210px" style="margin: 0 auto;" /> </a></p>
<center>Figure 9:  Impact of hidder size of network in QMIX.</center>
<p><br /></p>

<p><strong>Results</strong> The study in [<a href="#24">24</a>]  illustrates the ability of infinity width networks to fit any complex function, which would theoretically provide the performance gain from increasing network width. As shown in Figure <a href="#hiddensize">9</a>, the final performance or the efficiency of policy training would have varying degrees of improvement when we increase the hidden size of the network from 64 to 256 in QMIX, where <strong>QMIX-ALL-Hidden</strong> refers to the size of the network including RNN and mixing part, while <strong>QMIX-RNN-Hidden</strong> just refers to RNN. Also, the results reveal the spectacular effect of increasing the network width of RNN, which would allow for about a 20% increase in the Super-Hard scenarios <em>3s5z_vs_3s6z</em>. While the performance improvement is limited in enlarging the mixing network. We speculate that more units in the network are needed to represent the complex temporally context information in RNN, which is not included in the mixing network. We advise researchers to appropriately increase the network width of RNN to achieve better performance.</p>

<h2 id="exploration-steps"><a name="Exploration_Steps">Exploration Steps</a></h2>
<p>Exploration and exploitation are other classic trade-offs in reinforcement learning. Agents need some directed mechanisms to explore the states that may be of higher value or inexperienced. The most versatile method of exploration in RL is $\epsilon$-greedy action, which makes the agent select random actions with probability $\epsilon$, or select the greedy action with $1 - \epsilon$. The value of $\epsilon$ would drop down with training and then stays at a small constant. This exploration mechanism is
usually implemented for each agent to select their action, which has been criticized by MAVEN [<a href="#3">3</a>]  about lacking joint exploratory policy over an entire episode. However, we can still get more exploration when $\epsilon$ drops slower, then we evaluate the performance of the annealing period of $\epsilon$-greedy in some Super-Hard scenarios in SMAC.</p>

<p><a name="exploration">  <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/exploration.png" height="210px" style="margin: 0 auto;" /> </a></p>
<center>Figure 10: Experinments for ε anneal period.</center>
<p><br /></p>

<p><strong>Results</strong> Apparently, appropriately increasing the annealing period of $\epsilon$-greedy from 100K steps to 500K would get explicit performance gain in those hard exploration scenarios, where QMIX failed with the default setting. However, as shown in Figure <a href="#exploration">10</a>, too large steps like 1000K would also bring additional exploration noise even make the training collapse. The results above confirm the $\epsilon$-greedy mechanism is still the proper and simplest choice in MARL but should be elaboratively tuned for different tasks.</p>

<h2 id="integrating-the-techniques"><a name="Integrating_the_Techniques">Integrating the Techniques</a></h2>
<p>These techniques mentioned above indeed impacts QMIX in hard cooperative scenarios of SMAC, which really catches our attention to exhaust the extreme performance of QMIX. We combine these techniques and finetune all the hyperparameters in QMIX for each scenario of SMAC. As shown in Table <a href="#table1">1</a>, the finetuned-QMIX would almost conquer all the scenarios in SMAC and exceed the effect of the original QMIX with a large margin in some Hard and Super-Hard scenarios.</p>

<p><a name="table1"> </a></p>
<center>
    Table 1: Best median test win rate of Finetuned-QMIX and QMIX (batch size=128) in all scenarios.
</center>
<table style="text-align: center; width: 600px; margin: 0 auto; margin-bottom:20px; margin-top:20px">
  <thead>
    <tr>
      <th>Senarios</th>
      <th>Difficulty</th>
      <th>QMIX</th>
      <th>Finetuned-QMIX</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>10m_vs_11m</td>
      <td>Easy</td>
      <td>98%</td>
      <th>100%</th>
    </tr>
    <tr>
      <td>8m_vs_9m</td>
      <td>Hard</td>
      <td>84%</td>
      <th>100%</th>
    </tr>
    <tr>
      <td>5m_vs_6m</td>
      <td>Hard</td>
      <td>84%</td>
      <th>90%</th>
    </tr>
    <tr>
      <td>3s_vs_5z</td>
      <td>Hard</td>
      <td>96%</td>
      <th>100%</th>
    </tr>
    <tr>
      <td>bane_vs_bane</td>
      <td>Hard</td>
      <th>100%</th>
      <th>100%</th>
    </tr>
    <tr>
      <td>2c_vs_64zg</td>
      <td>Hard</td>
      <th>100%</th>
      <th>100%</th>
    </tr>
    <tr>
      <td>corridor</td>
      <td>Super hard</td>
      <td>0%</td>
      <th>100%</th>
    </tr>
    <tr>
      <td>MMM2</td>
      <td>Super hard</td>
      <td>98%</td>
      <th>100%</th>
    </tr>
    <tr>
      <td>3s5z_vs_3s6z</td>
      <td>Super hard</td>
      <td>3%</td>
      <th>93% (Hidden Size = 256)</th>
    </tr>
    <tr>
      <td>27m_vs_3s6z</td>
      <td>Super hard</td>
      <td>56%</td>
      <th>100%</th>
    </tr>
    <tr>
      <td>6h_vs_8z</td>
      <td>Super hard</td>
      <td>0%</td>
      <th>93% (λ = 0.3)</th>
    </tr>
  </tbody>
</table>

<p>Besides, we are really curious to see how these techniques mostly improve the performance of some subsequently proposed algorithms of QMIX or so. We then normalize the previous techniques for all these algorithms, i.e., we perform the same grid search pattern on typical Hard scenarios(<em>5m_vs_6m</em>) and Super-Hard scenario (<em>3s5z_vs_3s6z</em>) to find <strong>a general set of</strong> hyperparameters for each method. As shown in Table <a href="#table2">2</a>, QMIX still conquers the Super-hard tasks and could surpass other variants in most scenarios. In general, these variants of QMIX  [<a href="#9">9</a>; <a href="#11">11</a>; <a href="#13">13</a>] that aim to relax the <em>monotonicity constraint</em> could not obtain better performance than QMIX to some extent. This fact demonstrates the powerful QMIX is more than just a baseline algorithm on cooperative scenarios.</p>

<p><a name="table2"> </a></p>
<center>
    Table 2: Median test-winning rate (or episode return) of MARL algorithms with normalized techniques. S-Hard denotes the Super-Hard level. We compare their performance in the most difficult scenarios of SMAC and Predator-Prey-1.
</center>
<table style="text-align: center; width: 900px; margin: 0 auto; margin-bottom:20px; margin-top:20px">
<thead>
  <tr>
    <th rowspan="2">Scenarios</th>
    <th rowspan="2">Difficulty</th>
    <th colspan="7">Algorithm</th>
  </tr>
  <tr>
    <th>QMIX</th>
    <th>VDN</th>
    <th>Qatten</th>
    <th>QPLEX</th>
    <th>WQMIX</th>
    <th>VMIX</th>
    <th>AC-MIX</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>2c_vs_64zg</td>
    <td>Hard</td>
    <th>100%</th>
    <th>100%</th>
    <th>100%</th>
    <th>100%</th>
    <th>100%</th>
    <td>98%</td>
    <th>100%</th>
  </tr>
  <tr>
    <td>8m_vs_9m</td>
    <td>Hard</td>
    <th>100%</th>
    <th>100%</th>
    <th>100%</th>
    <td>95%</td>
    <td>95%</td>
    <td>75%</td>
    <td>95%</td>
  </tr>
  <tr>
    <td>3s_vs_5z</td>
    <td>Hard</td>
    <th>100%</th>
    <th>100%</th>
    <th>100%</th>
    <th>100%</th>
    <th>100%</th>
    <th>96%</th>
    <th>96%</th>
  </tr>
  <tr>
    <td>5m_vs_6m</td>
    <td>Hard</td>
    <th>90%</th>
    <th>90%</th>
    <th>90%</th>
    <th>90%</th>
    <th>90%</th>
    <td>9%</td>
    <td>67%</td>
  </tr>
  <tr>
    <td>3s5z_vs_3s6z</td>
    <td>S-Hard</td>
    <th>75%</th>
    <td>43%</td>
    <td>62%</td>
    <td>68%</td>
    <td>56%</td>
    <td>56%</td>
    <th>75%</th>
  </tr>
  <tr>
    <td>Corridor</td>
    <td>S-Hard</td>
    <th>100%</th>
    <td>98%</td>
    <th>100%</th>
    <td>96%</td>
    <td>96%</td>
    <td>0%</td>
    <th>100%</th>
  </tr>
  <tr>
    <td>6h_vs_8z</td>
    <td>S-Hard</td>
    <td>84%</td>
    <th>87%</th>
    <td>82%</td>
    <td>78%</td>
    <td>75%</td>
    <td>80%</td>
    <td>19%</td>
  </tr>
  <tr>
    <td>MMM2</td>
    <td>S-Hard</td>
    <th>100%</th>
    <td>96%</td>
    <th>100%</th>
    <th>100%</th>
    <td>96%</td>
    <td>70%</td>
    <th>100%</th>
  </tr>
  <tr>
    <td>27m_vs_30m</td>
    <td>S-Hard</td>
    <th>100%</th>
    <th>100%</th>
    <th>100%</th>
    <th>100%</th>
    <th>100%</th>
    <td>93%</td>
    <td>93%</td>
  </tr>
  <tr>
    <td>Predator-Prey-1</td>
    <td>-</td>
    <th>40</th>
    <td>39</td>
    <td>-</td>
    <td>39</td>
    <td>39</td>
    <td>39</td>
    <td>38</td>
  </tr>
  <tr>
    <td>Avg. Score</td>
    <td>-</td>
    <th>94.9%</th>
    <td>91.2%</td>
    <td>92.7%</td>
    <td>92.5%</td>
    <td>90.5%</td>
    <td>67.4%</td>
    <td>84.0%</td>
  </tr>
</tbody>
</table>
<p><br /></p>

<h1 id="role-of-monotonicity-constraint"><a name="Role_of_Monotonicity_Constraint">Role of Monotonicity Constraint</a></h1>
<h2 id="amazing-performance-in-policy-based-methods"><a name="Amazing_Performance_in_Policy-Based_Methods">Amazing Performance in Policy-Based Methods</a></h2>
<p><a name="qmix_sy"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/riit.png" height="180px" style="margin: 0 auto;" />  </a></p>

<center>Figure 11: Architecture for AC-MIX: |·| denotes absolute value operation , implementing the monotonicity constraint of QMIX. <b>W </b> denotes the non-negative mixing weights. Agent i denotes the policy network, which can be trained end-to-end by maximizing the $Q_{tot}$.</center>
<p><br /></p>

<p>The novelty of QMIX is the IGM continuity between $\text{argmax} Q_{tot}$ and $\text{argmax} \sum_{i}^{N} Q_{i}$, which is implemented in the mixing network. We still expect to study the role of <em>monotonicity constraint</em> in MARL. Therefore, we propose an actor-critic style algorithm called Actor-Critic-Mixer (AC-MIX), which has a similar architecture to QMIX. As illustrated in Figure <a href="#qmix_sy">11</a>, we use the monotonic mixing network as a centralized critic, which integrates $Q_{i}$ of each agent, to optimize the decentralized policy networks $π^i_{θ_i}$ in an end-to-end pattern. We still add the Adaptive Entropy <a href="#18">[18]</a> of each agent in the optimization object of Eq. \ref{eq3} to get more exploration, and the detail of the algorithm will be described in Appendix <a href="#A">A</a>.</p>

\[\max _{\theta} \mathbb{E}_{t, s_{t}, \tau_{t}^{1}, \ldots, \tau_{t}^{n}}\left[Q_{\theta_{c}}^{\pi}\left(s_{t}, \pi_{\theta_{1}}^{1}\left(\cdot \mid \tau_{t}^{1}\right), \ldots, \pi_{\theta_{n}}^{n}\left(\cdot \mid \tau_{t}^{n}\right)\right)
+ \mathbb{E}_{i}\left[\mathcal{H}\left(\pi_{\theta_{i}}^{i}\left(\cdot \mid \tau_{t}^{i}\right)\right)\right]\right] \tag{3} \label{eq3}\]

<p><a name="riit_abla"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/monotonicity_riit.png" height="210px" style="margin: 0 auto; margin-top: 20px" /> </a></p>

<center>Figure 12: Comparing AC-MIX w./ and w./o. monotonicity constraint (remove absolute value operation) on SMAC and Predator-Prey-2</center>
<p><br /></p>

<p>As the monotonicity constraint on the critic of AC-MIX is theoretically no longer required as the critic is not used for greedy action selection. We can evaluate the effects of the monotonicity constraint by removing the absolute value operation in the mixing network. The results in Figure <a href="#riit_abla">12</a> demonstrate the <em>monotonicity constraint</em> significantly improves the performance of AC-MIX. Then to explore the generality of <em>monotonicity constraints</em> in the parallel sampling framework of MARL, we extend the above experiments to VMIX [<a href="#12">12</a>] . VMIX adds the monotonicity constraint to the value network of A2C, and learns the policy of each agent by advantage-based policy gradient [<a href="#14">14</a>]  as illustrated in Figure <a href="#vmix_net">13</a>. Still, the result from Figure <a href="#vmix_abla">14</a> shows that the monotonicity constraint improves the sample efficiency in value networks.</p>

<p><a name="vmix_net"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/vmix.png" height="190px" style="margin: 0 auto;" /> </a></p>
<center>Figure 13. Architecture for VMIX: |·| denotes absolute value operation</center>
<p><br /></p>

<p><a name="vmix_abla"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/monotonicity_vmix.png" height="210px" style="margin: 0 auto;" /> </a></p>

<center>Figure 14: Comparing VMIX w./ and w./o. monotonicity constraint (remove absolute value operation) on SMAC</center>
<p><br /></p>

<h2 id="what-is-under-the-hood"><a name="What_is_Under_the_Hood">What is Under the Hood?</a></h2>
<p>Observed from the results of previous experiments, the <em>monotonicity constraints</em> in the mixing network indeed improve performance and sample efficiency of training, but on the flip side of the coin, QMIX is still criticized for the insufficient expressive capacity of the centralized critic. The most common verifying experiment is <strong>Single-state Matrix Game</strong>, which only contains two agents with three actions each, and needs to capture the joint action $(A, A)$ as in Table <a href="#table3">3</a>. Actually, when we visualize the pay-offs of these matrices in Figure <a href="#original_version">15</a>, we could find them there is a deep “ditch” between the optimal and sub-optimal joint-action, which is a representative <em>Relative Overgeneralization</em> pathology in multi-agent tasks.</p>

<head>
	<meta charset="UTF-8" />
	<title>表格</title>
	<style type="text/css">

	.text{text-align:center;}
        .slash {
            width: 100px;
            height: 50px;
            background-color: #000000;
            position: relative;
            padding: 0 !important;
        }

        .slash::before {
            content: '';
            display: block;
            width: 100%;
            height: 100%;
            background-color: #FFFFFF;
            clip-path: polygon(0px 0.5px, 0px 100%, calc(100% - 0.5px) calc(100% + 0.5px));
            position: absolute;
            top: 0;
        }

        .slash::after {
            content: '';
            display: block;
            width: 100%;
            height: 100%;
            background-color: #FFFFFF;
            clip-path: polygon(100% calc(100% - 0.5px), 100% 0px, 0px -0.5px);
            position: absolute;
            top: 0;
        }
		.clearfix:after {
			content: '.';
			height: 0;
			display: block;
			clear: both;
		}
    </style>
</head>
<body>
	<a name="table3"> </a>
	<div style="text-align: center;">Table 3: Single-state Matrix Game</div>
	<div class="clearfix" style="text-align: center; margin-left: 100px;">
	<div style="float:left; margin: 50px;text-align:center; margin-top: 20px; margin-bottom:20px">
		<div>Table 3(a): Original version</div>
		<table border="1" cellspacing="0" width="400">
        <tr>
            <td class="slash">
                <span style="position: absolute;left: 15px;bottom: 3px;z-index: 1;">a1</span>
                <span style="position: absolute;right: 15px;top: 3px;z-index: 1;">a2</span>
            </td>
            <td>A</td>
            <td>B</td>
            <td>C</td>
        </tr>
        <tr>
		    <td>A</td>
		    <th>8</th>
		    <td>-12</td>
		    <td>-12</td>
		</tr>
		<tr>
		    <td>B</td>
		    <td>-12</td>
		    <td>0</td>
		    <td>0</td>
		</tr>
		<tr>
		    <td>C</td>
		    <td>-12</td>
		    <td>0</td>
		    <td>0</td>
		</tr>
    </table>
	</div>
    <div style="float:left; margin: 50px;text-align:center;margin-top: 20px; margin-bottom:20px">
		<div>Table 3(b): Hard version</div>
		<table border="1" cellspacing="0" width="400">
        <tr>
            <td class="slash">
                <span style="position: absolute;left: 15px;bottom: 3px;z-index: 1;">a1</span>
                <span style="position: absolute;right: 15px;top: 3px;z-index: 1;">a2</span>
            </td>
            <td>A</td>
            <td>B</td>
            <td>C</td>
        </tr>
        <tr>
		    <td>A</td>
		    <th>12</th>
		    <td>-15</td>
		    <td>-15</td>
		</tr>
		<tr>
		    <td>B</td>
		    <td>-15</td>
		    <td>9</td>
		    <td>9</td>
		</tr>
		<tr>
		    <td>C</td>
		    <td>-15</td>
		    <td>9</td>
		    <td>9</td>
		</tr>
    </table>
	</div>

	<div style="float:left; margin: 50px;text-align:center; margin-top: 20px; margin-bottom:20px">
		<div>Table 3(c): Easy version</div>
		<table border="1" cellspacing="0" width="400">
		<tr>
		    <td class="slash">
		        <span style="position: absolute;left: 15px;bottom: 3px;z-index: 1;">a1</span>
		        <span style="position: absolute;right: 15px;top: 3px;z-index: 1;">a2</span>
		    </td>
		    <td>A</td>
		    <td>B</td>
		    <td>C</td>
		</tr>
		<tr>
			    <td>A</td>
			    <th>6</th>
			    <td>-5</td>
			    <td>-5</td>
			</tr>
			<tr>
			    <td>B</td>
			    <td>-5</td>
			    <td>0</td>
			    <td>0</td>
			</tr>
			<tr>
			    <td>C</td>
			    <td>-5</td>
			    <td>0</td>
			    <td>0</td>
			</tr>
	    </table>
	</div>
	</div>
</body>

<head>
	<meta charset="UTF-8" />
	<title>图片</title>
	<style type="text/css">
		.clearfix:after {
			content: '.';
			height: 0;
			display: block;
			clear: both;
		}
    </style>
</head>
<div class="clearfix" style="margin-left: 140px;">
<div style="float:left; margin: 0px;text-align:center;margin-top: 0px;"><a name="original_version"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/original_version.png" height="240px" /> </a> <div>(a) original version</div></div>
<div style="float:left; margin: 0px;text-align:center;margin-top: 0px;"><a name="hard_version"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/hard_version.png" height="240px" /> </a> <div>(b) hard version</div></div>
<div style="float:left; margin: 0px;text-align:center;margin-top: 0px; margin-bottom: 20px"><a name="easy_version"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/easy_version.png" height="240px" /> </a> <div>(c) easy version</div></div>
</div>

<center>Figure 15: Illustrations of different level of Single-state Matrix Game corresponding to Table <a ref="#table3">3</a></center>
<p><br /></p>

<p>Still, QMIX could not learn the accurate pay-offs of <strong>matrix game</strong> as the learning results in Table <a href="#table4">4</a>, even when we implement the full exploration (i.e., $\epsilon$=1 in $\epsilon$-greedy) during the whole training process. Researchers give a proper analysis of consistency between the deterministic greedy decentralized policies and the deterministic greedy centralized policy based on the optimal joint action-value function in [<a href="#19">19</a>]. As illustrated in Figure <a href="#monotonic">16</a>, the consistency of argmax operator performed on $Q_{tot}$ enforces the learning results to be monotonic, which would make $Q_{tot}$ inaccurate to estimate values in <em>Relative Overgeneralization</em> problems.</p>

<head>
	<meta charset="UTF-8" />
	<title>表格</title>
	<style type="text/css">

	.text{text-align:center;}
        .slash {
            width: 100px;
            height: 50px;
            background-color: #000000;
            position: relative;
            padding: 0 !important;
        }

        .slash::before {
            content: '';
            display: block;
            width: 100%;
            height: 100%;
            background-color: #FFFFFF;
            clip-path: polygon(0px 0.5px, 0px 100%, calc(100% - 0.5px) calc(100% + 0.5px));
            position: absolute;
            top: 0;
        }

        .slash::after {
            content: '';
            display: block;
            width: 100%;
            height: 100%;
            background-color: #FFFFFF;
            clip-path: polygon(100% calc(100% - 0.5px), 100% 0px, 0px -0.5px);
            position: absolute;
            top: 0;
        }
		.clearfix:after {
			content: '.';
			height: 0;
			display: block;
			clear: both;
		}
    </style>
</head>
<body>
	<a name="table4"> </a>
	<div style="text-align: center;">Table 4: Learning results of QMIX in Single-state Matrix Game</div>
	<div class="clearfix" style="text-align: center; margin-left: 70px;">
	<div style="float:left; margin: 30px;text-align:center; margin-top: 20px; margin-bottom:20px">
		<div>Table 4(a): Original version</div>
		<table border="1" cellspacing="0" width="400">
        <tr>
            <td class="slash">
                <span style="position: absolute;left: 15px;bottom: 3px;z-index: 1;">a1</span>
                <span style="position: absolute;right: 15px;top: 3px;z-index: 1;">a2</span>
            </td>
            <td>A</td>
            <td>B</td>
            <td>C</td>
        </tr>
        <tr>
		    <td>A</td>
		    <td>-9.26</td>
		    <td>-9.44</td>
		    <td>-9.62</td>
		</tr>
		<tr>
		    <td>B</td>
		    <td>-9.1</td>
		    <th>0</th>
		    <td>-0.05</td>
		</tr>
		<tr>
		    <td>C</td>
		    <td>-9.27</td>
		    <td>-0.08</td>
		    <td>-0.6</td>
		</tr>
    </table>
	</div>
    <div style="float:left; margin: 30px;text-align:center;margin-top: 20px; margin-bottom:20px">
		<div>Table 4(b): Hard version</div>
		<table border="1" cellspacing="0" width="400">
        <tr>
            <td class="slash">
                <span style="position: absolute;left: 15px;bottom: 3px;z-index: 1;">a1</span>
                <span style="position: absolute;right: 15px;top: 3px;z-index: 1;">a2</span>
            </td>
            <td>A</td>
            <td>B</td>
            <td>C</td>
        </tr>
        <tr>
		    <td>A</td>
		    <td>-9.83</td>
		    <td>-10.01</td>
		    <td>-10.18</td>
		</tr>
		<tr>
		    <td>B</td>
		    <td>-9.65</td>
		    <td>-0.04</td>
		    <td>-0.39</td>
		</tr>
		<tr>
		    <td>C</td>
		    <td>-9.82</td>
		    <th>0.16</th>
		    <td>0.01</td>
		</tr>
    </table>
	</div>

	<div style="float:left; margin: 30px;text-align:center; margin-top: 20px; margin-bottom:20px">
		<div>Table 4(c): Easy version</div>
		<table border="1" cellspacing="0" width="400">
		<tr>
		    <td class="slash">
		        <span style="position: absolute;left: 15px;bottom: 3px;z-index: 1;">a1</span>
		        <span style="position: absolute;right: 15px;top: 3px;z-index: 1;">a2</span>
		    </td>
		    <td>A</td>
		    <td>B</td>
		    <td>C</td>
		</tr>
		<tr>
			    <td>A</td>
			    <th>6.43</th>
			    <td>-2.80</td>
			    <td>-2.66</td>
			</tr>
			<tr>
			    <td>B</td>
			    <td>-2.50</td>
			    <td>-3.05</td>
			    <td>-2.86</td>
			</tr>
			<tr>
			    <td>C</td>
			    <td>-2.66</td>
			    <td>-2.82</td>
			    <td>-2.67</td>
			</tr>
	    </table>
	</div>
	</div>
</body>

<p><a name="monotonic"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/monotonic.png" height="300px" style="margin: 0 auto;" /> </a></p>

<center>Figure 16: Monotonicity in mixing network of QMIX.<br />(Image source: QMIX <a href="https://jmlr.org/papers/volume21/20-081/20-081.pdf">[8]</a>)</center>
<p><br /></p>

<p>The abnormal question naturally occurred to us: (1) <em>Why the performance of QMIX would be better than its variants like WQMIX or Qtran that aims to relax the monotonicity constraint of mixing network</em>? (2) <em>How to overcome the disadvantage of inaccurate $Q_{tot}$ of QMIX</em>?</p>

<p>To answer these two questions we first need to reexamine the IGM principle. Since the monotonicity in QMIX is defined as a constraint on the relationship between $Q_{tot}$ and each $Q_{i}$ :</p>

\[Q_{tot} = \sum_{i=1}^{N}w_{i}(s_{t}) \cdot Q_{i} + b(s_{t}), \\
w_{i} = \frac{\partial Q_{tot}}{\partial Q_{i}} \geq 0, \forall i \in A.
\tag{4} \label{eq4}\]

<p>From the sufficient condition above, the weight $w_{i}$ generated by <em>hyper-network</em> would be forced to be greater or equal to zero. To put it another way, it makes the parameter space smaller for searching $w_{i}$ weights. As illustrated in the schematic diagram <a href="#diagram">17</a> with just two agents, assume the red region is the original search space, the restricted search space of $w_{i}$ is the blue region in the first quadrant. Then the optimal solution in the original domain cannot be expressed correctly in the restricted region. On the other hand, the search area of exhausting the whole joint state-action space would also be decreased exponentially by $(\frac{1}{2})^{N}$ ($N$ demotes the number of $w_{i}$, as well as the number of agents). Since the essence of learning in MARL is to search for the optimal joint-policy parameterized by weights and bias of agents and mixing network, QMIX could find a satisfying policy more quickly in the reduced parameter space.</p>

<p><a name="diagram"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/diagram.png" height="400px" style="margin: 0 auto;" /> </a></p>

<center>Figure 17: Diagram of parameter searching space of two agents in QMIX</center>
<p><br /></p>

<p>As a side effect, the global optimum may not be in the parameter space that QMIX needs to search at all due to the monotonicity of the mixing network. One effective way is to estimate the $Q_{tot}$ as accurately as possible in the hope that it could find the global optimum, this probably explains why $Q(\lambda)$ in the previous section could result in such a performance improvement in SMAC. On the other hand, we could delicately design the reward function to be approximate monotonic when we use QMIX to solve cooperative multi-agent tasks. Then adapting the algorithm to the test environment is not a good idea, after all, we still need to figure out how to use QMIX more effectively or develop other more efficient algorithms.</p>

<h1 id="reproducibility-and-fairness"><a name="Reproducibility_and_Fairness">Reproducibility and Fairness</a></h1>
<p>Since experimental techniques would have such a great impact on the performance of QMIX, we need to be very careful when we treat QMIX as a baseline to compare its performance with newly proposed algorithms (especially some composite algorithms). Some cooperative tasks may only have a few simple metrics to estimate the capacity of the algorithm (just as the <em>win rates</em> of different scenarios in SMAC), it is still unpersuasive the performance improvement comes from an elaborate design that is specific to the cooperative tasks or just the fine-tuned techniques and hyperparameters. To ensure continued progress in MARL, we are eager for the community to start a discussion about fair comparisons among algorithms and propose a rigorous set of criteria to judge the contribution of new algorithms. We believe we community members also should consider what are the best ways to demonstrate that MARL continues to matter as RL.</p>

<h1 id="appendix"><a name="Appendix">Appendix</a></h1>

<h2 id="a-pseudo-code-of-ac-mix-">A Pseudo-code of AC-MIX<a id="A"> </a></h2>

<p>In this section, we show the pseudo-code for the training procedure of AC-MIX. (1) Training the critic network with offline samples and 1-step TD error loss improves the sample efficiency for critic networks; (2) We find that policy networks are sensitive to old samples reuse. Training policy networks end-to-end and critic with TD($\lambda$) and online samples improve the learning stability of AC-MIX.</p>

<p><a name="algorithm_riit"> <img src="https://iclr.iro.umontreal.ca/edc1bf88-ae20-488e-b07a-0154e75d47ab_1642209162/public/images/2021-12-01-Implementations_that_Matter_in_Cooperative_Multi-Agent_Reinforcement_Learning/algorithm_riit.png" height="800px" style="margin: 0 auto; margin-bottom:20px;" /> </a></p>

<h2 id="b-hyperparameters">B HYPERPARAMETERS</h2>

<p>In this section, we present our hyperparameters tuning process. We get the optimal hyperparameters for each algorithm by grid search, shown in Table <a href="#t5">5</a>. Specifically,</p>

<blockquote>
  <ol>
    <li>For experiments in Table <a href="#table1">1</a>, we perform a hyperparameter search on each scenario for QMIX to demonstrate the best performance of QMIX.</li>
    <li>For experiments in Table <a href="#table2">2</a>, we perform grid search schemes on a typical hard environment (5m_vs_6m) and super hard environment (3s5z_vs_3s6z) to find a general set of hyperparameters for each algorithm. In this way, we can evaluate the robustness of these MARL algorithms.</li>
  </ol>
</blockquote>

<center>
   Table 5: Hyperparameters Search on SMAC.
</center>
<table style="text-align: center; width: 900px; margin: 0 auto; margin-bottom:20px; margin-top:20px;"><a name="t5"> </a>
  <thead>
    <tr>
      <th>Tricks</th>
      <th>Value-based(VB)</th>
      <th>Policy-bassed(PG)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Optimizer</td>
      <td>Adam,RMSProp</td>
      <td>Adam,RMSProp</td>
    </tr>
    <tr>
      <td>Learning Rates</td>
      <td>0.0005, 0.001</td>
      <td>0.0005, 0.001</td>
    </tr>
    <tr>
      <td>Batch Size (episodes)</td>
      <td>32, 64, 128</td>
      <td>32, 64 </td>
    </tr>
    <tr>
      <td>Replay Buffer Size</td>
      <td>5000, 10000, 20000</td>
      <td>2000, 5000, 10000</td>
    </tr>
    <tr>
      <td>Q(λ)/TD(λ)</td>
      <td>0, 0.3, 0.6, 0.9</td>
      <td>0.3, 0.6, 0.8</td>
    </tr>
    <tr>
      <td>Entropy/Adaptive Entropy</td>
      <td>-</td>
      <td>0.005, 0.01, 0.03, 0.06</td>
    </tr>
    <tr>
      <td>ε Anneal Steps</td>
      <td>50K, 100K, 500K, 1000K</td>
      <td>-</td>
    </tr>
  </tbody>
</table>
<p><br /></p>

<center>
Table 6: Hyperparameters Settings.
</center>
<table style="text-align: center; width: 1200px; margin: 0 auto; margin-bottom:20px; margin-top:20px"><a name="t6"> </a>
  <thead>
    <tr>
      <td>Algorithms</td>
      <td>QMIX</td>
      <td>OurQMIX</td>
      <td>Qatten</td>
      <td>OurQatten</td>
      <td>QPLEX</td>
      <td>OurQPLEX</td>
      <td>WQMIX</td>
      <td>OurWQMIX</td>
      <td>AC-MIX</td>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Optimizer</td>
        <td>RMSProp</td>
      <td>Adam</td>
      <td>RMSProp</td>
      <td>Adam</td>
      <td>RMSProp</td>
        <td>Adam</td>
        <td>RMSProp</td>
        <td>Adam</td>
        <td>Adam</td>
    </tr>
    <tr>
        <td>Batch Size (eps)</td>
        <td>32</td>
        <td>128</td>
        <td>32</td>
      <td>128</td>
      <td>32</td>
        <td>128</td>
        <td>32</td>
        <td>128</td>
         <td>32(on)/64(off)</td>
    </tr>
    <tr>
        <td>Q(λ)/TD(λ)</td>
        <td>0</td>
        <td>0.6</td>
        <td>0</td>
      <td>0.6</td>
      <td>0</td>
        <td>0.6</td>
            <td>0</td>
        <td>0.6</td>
        <td>0.6</td>
    </tr>
    <tr>
    <td>Attention Heads</td>
        <td>-</td>
        <td>-</td>
        <td>4</td>
        <td>4</td>
        <td>10</td>
        <td>4</td>
        <td>-</td>
        <td>-</td>
        <td>-</td>
    </tr>
    <tr>
        <td>Mixing-Net Size</td>
        <td>41K</td>
        <td>41K</td>
        <td>58K</td>
        <td>58K</td>
        <td>476K</td>
        <td>152K</td>
        <td>247K</td>
        <td>247K</td>
        <td>69K</td>
    </tr>
    <tr>
        <td>ε Anneal Steps</td>
        <td colspan="8">50K→500K for 6h_vs_8z, 100K for others</td>
        <td>-</td>
    </tr>
    <tr>
        <td>Processes Num</td>
        <td>8</td>
        <td>8</td>
        <td>1</td>
        <td>8</td>
        <td>1</td>
        <td>8</td>
        <td>1</td>
        <td>8</td>
        <td>8</td>
    </tr>
  </tbody>
</table>
<p><br /></p>

<p>Table <a href="#t6">6</a> shows our general settings for these algorithms. The network size is calculated under 6h_vs_8z, where adding <em>Our</em> denotes the fine-tuned hyperparameter settings. Next, we describe in detail the setting of these hyperparameters.</p>

<p><strong>Neural Network Size</strong>. We first ensure the network size is the same order of magnitude, i.e,  we use 4 attention heads leading the mixing-net size of QPLEX from 476K to 152K. All the hidden size of agent networks is 64, the same as those found in QMIX [<a href="#8">8</a>].</p>

<p><strong>Optimizer &amp; Learning Rate</strong>. We use Adam to optimize all networks, except VMIX (works better with RMSProp), as it may accelerate the convergence of the algorithms. All neural networks are trained with a 0.001 learning rate.</p>

<p><strong>Batch Size</strong>. As we find that a large batch size helps to improve the stability of the algorithms. For all value-based algorithms, we set the batch size to 128. For the policy-based algorithms, we set the batch size to 64/32 (Offline/Online training) due to the fact that online update requires only the newest data.</p>

<p><strong>Replay Buffer Size</strong>. As discussed in previous sections, a small replay buffer size facilitates the convergence of the MARL algorithms. Therefore, for SMAC, the size of all replay buffers is set to 5000 episodes. For Predator-Prey, we set the buffer size to 1000 episodes.</p>

<p><strong>Exploration</strong>. As discussed in previous sections, we use $\epsilon$-greedy action selection, decreasing $\epsilon$ from 1 to 0.05 over n-time steps (n can be found in Table <a href="#t6">6</a>) for value-based algorithms. For VMIX, we use the policy entropy loss and fine-tune the coefficients for different scenarios.</p>

<p><strong>N-step returns</strong>. We find that the $\lambda$ values of Q($\lambda$) and TD($\lambda$) are heavily dependent on the algorithms and scenarios. We are using $\lambda$ = 0.6 for all tasks as it works stably in most scenarios. And, for the on-policy algorithm VMIX, we set $\lambda$ = 0.8.</p>

<p><strong>Rollout Processes Number</strong>. For SMAC and Predator-Prey-1, 8 rollout processes for parallel sampling are used to obtain as many samples as possible from the environments at a high rate.
And, 4 rollout processes are used for Predator-Prey-2. All the algorithms use the same number of processes to ensure the same number of policy iterations.</p>

<p><strong>Other Settings</strong>. We set all discount factors $\gamma$ = 0.99. We update the target network every 200 episodes. We find that the optimal hyperparameters of the value-based algorithms are similar due to the fact that they share the same basic architecture and training paradigm. Therefore, the settings for VDNs are the same as for QMIX.</p>

<h1 id="reference"><a name="Reference">Reference</a></h1>

<p><a name="1" href="https://arxiv.org/abs/1412.6980">[1] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR 2015, San Diego, CA, USA, May 7-9, 2015, 2015. </a></p>

<p><a name="2" href="https://arxiv.org/abs/2103.00107">[2] Tadashi Kozuno, Yunhao Tang, Mark Rowland, Remi Munos, Steven Kapturowski, Will Dabney, Michal Valko, and David Abel. Revisiting peng’s q (λ) for modern reinforcement learning. arXiv preprint arXiv:2103.00107, 2021.  </a></p>

<p><a name="3" href="https://arxiv.org/abs/1910.07483">[3] Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. MAVEN: multi-agent variational exploration. In NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada,pp. 7611–7622, 2019. </a></p>

<p><a name="4" href="https://arxiv.org/abs/1312.5602">[4] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.</a></p>

<p><a name="5" href="http://proceedings.mlr.press/v48/mniha16.html">[5] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 1928–1937, 2016.</a></p>

<p><a name="6" href="https://www.comp.nus.edu.sg/~leews/publications/rss09.pdf">[6] Sylvie CW Ong, Shao Wei Png, David Hsu, and Wee Sun Lee. Pomdps for robotic tasks with mixed observability. 5:4, 2009.  </a></p>

<p><a name="7" href="https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&amp;context=cs_faculty_pubs">[7] Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. In ICML 2000, Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pp.759–766. Morgan Kaufmann, 2000. </a></p>

<p><a name="8" href="http://proceedings.mlr.press/v80/rashid18a.html">[8] Tabish Rashid, Mikayel Samvelyan, Christian Schr ̈oder de Witt, Gregory Farquhar, Jakob N.Foerster, and Shimon Whiteson. QMIX: monotonic value function factorization for deep multi-agent reinforcement learning. In ICML 2018, Stockholmsmassan, Stockholm, Sweden, July10-15, 2018, pp. 4292–4301, 2018.  </a></p>

<p><a name="9" href="https://ui.adsabs.harvard.edu/abs/2020arXiv200610800R/abstract">[9] Tabish Rashid, Gregory Farquhar, Bei Peng, and Shimon Whiteson. Weighted QMIX: Expand-ing Monotonic Value Function Factorisation. arXiv preprint arXiv:2006.10800, 2020. </a></p>

<p><a name="10" href="https://arxiv.org/abs/1902.04043">[10] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, NantasNardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, and ShimonWhiteson. The StarCraft Multi-Agent Challenge.arXiv preprint arXiv:1902.04043, 2019. </a></p>

<p><a name="11" href="http://proceedings.mlr.press/v97/son19a.html">[11] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Hostallero, and Yung Yi. QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 5887–5896, 2019. </a></p>

<p><a name="12" href="https://www.aaai.org/AAAI21Papers/AAAI-2412.SuJ.pdf">[12] Jianyu Su, Stephen Adams, and Peter A. Beling. Value-Decomposition Multi-Agent Actor-Critics. arXiv:2007.12306, 2020.  </a></p>

<p><a name="13" href="https://arxiv.org/abs/1706.05296">[13] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi,Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Grae-pel. Value-Decomposition Networks For Cooperative Multi-Agent Learning. arXiv preprint arXiv:1706.05296, 2017.  </a></p>

<p><a name="14" href="https://go.gale.com/ps/i.do?id=GALE%7CA61573878&amp;sid=googleScholar&amp;v=2.1&amp;it=r&amp;linkaccess=abs&amp;issn=07384602&amp;p=AONE&amp;sw=w">[14] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. </a></p>

<p><a name="15" href="https://arxiv.org/abs/2008.01062">[15] Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. QPLEX: Duplex Dueling Multi-Agent Q-Learning. arXiv:2008.01062, 2020.  </a></p>

<p><a name="16" href="https://www.aaai.org/ocs/index.php/SSS/SSS18/paper/viewPaper/17508">[16] Ermo Wei, Drew Wicke, David Freelan, and Sean Luke. Multiagent Soft Q-Learning. arXivpreprint arXiv:1804.09817, 2018.  </a></p>

<p><a name="17" href="https://arxiv.org/abs/2002.03939">[17] Yaodong Yang, Jianye Hao, Ben Liao, Kun Shao, Guangyong Chen, Wulong Liu, and HongyaoTang. Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning. arXiv preprint arXiv:2002.03939, 2020. </a></p>

<p><a name="18" href="https://arxiv.org/abs/2010.09776">[18] Ming Zhou, Jun Luo, and Julian Villella et al. Smarts: Scalable multi-agent reinforcement learning training school for autonomous driving, 2020. </a></p>

<p><a name="19" href="https://www.jmlr.org/papers/volume21/20-081/20-081.pdf">[19] Rashid T, Samvelyan M, Schroeder de Witt C, et al. Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research, 2020, 21.</a></p>

<p><a name="20" href="https://ojs.aaai.org/index.php/AAAI/article/view/6223">[20] Wen C, Yao X, Wang Y, et al. Smix (λ): Enhancing centralized value functions for cooperative multi-agent reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(05): 7301-7308.  </a></p>

<p><a name="21" href="http://www.cs.toronto.edu/~hinton/coursera/lecture6/lec6.pdf">[21] Hinton G, Srivastava N, Swersky K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 2012, 14(8): 2.</a></p>

<p><a name="22" href="http://proceedings.mlr.press/v119/fedus20a.html">[22] Fedus W, Ramachandran P, Agarwal R, et al. Revisiting fundamentals of experience replay. International Conference on Machine Learning. PMLR, 2020: 3061-3071.  </a></p>

<p><a name="23" href="http://proceedings.mlr.press/v119/ota20a.html">[23] Ota K, Oiki T, Jha D, et al. Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?. International Conference on Machine Learning. PMLR, 2020: 7424-7433.</a></p>

<p><a name="24" href="https://arxiv.org/abs/1312.6184">[24] Ba L J, Caruana R. Do deep nets really need to be deep?. arXiv preprint arXiv:1312.6184, 2013.</a></p>

<p><a name="25" href="https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html">[25] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.</a></p>

<p><a name="26" href="https://openaccess.thecvf.com/content_cvpr_2017/html/Huang_Densely_Connected_Convolutional_CVPR_2017_paper.html">[26] Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4700-4708.</a></p>

<p><a name="27" href="http://proceedings.mlr.press/v48/wangf16.html">[27] Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep reinforcement learning. International conference on machine learning. PMLR, 2016: 1995-2003.</a></p>


</div>

<div id="bibtex-container" class="related">
  For attribution in academic contexts, please cite this work as
  <pre id="bibtex-academic-attribution">

  </pre>

  BibTeX citation
  <pre id="bibtex-box">

  </pre>
</div>
<script>
  let authorsSpan = document.getElementById("iclr-post-authors");
  let authorsText = authorsSpan.textContent;
  let lnameFnameInstitution = authorsText.split(";");
  let lfiList = lnameFnameInstitution.map(lfi => lfi.split(",").map(item => item.trim()));
  let bibtexLFI = lfiList.map(lfi => lfi[0] + ", " + lfi[1]).join(" and ")
  let academicLFI = lfiList.map(lfi => lfi[0]);
  {
    if(academicLFI.length > 2) academicLFI = academicLFI[0] + ", et al.";
    else if(academicLFI.length == 2) academicLFI = academicLFI[0] + " & " + academicLFI[1];
    else academicLFI = academicLFI[0];
  }

  let titleSpan = document.getElementById("iclr-post-title");
  let titleText = titleSpan.textContent.trim();
  let bibtexTitleShorthand = (lfiList[0][1]+
    "2022"+
    titleText.split(" ").slice(0, 3).join("")
  ).replace(" ", "").replace(/[\p{P}$+<=>^`|~]/gu, '').toLowerCase().trim();

  let bibtexTemplate = `
@inproceedings{${bibtexTitleShorthand}},
  author = {${bibtexLFI}},
  title = {${titleText}},
  booktitle = {ICLR Blog Track},
  year = {2022},
  note = {${window.location.href}},
  url  = {${window.location.href}}
}
  `.trim();
  document.getElementById("bibtex-box").innerText = bibtexTemplate;

  let academicTemplate = `
${academicLFI}, "${titleText}", ICLR Blog Track, 2022.
`.trim();
  document.getElementById("bibtex-academic-attribution").innerText = academicTemplate;

</script>


<div class="related">
  <h2>Related posts</h2>
  <ul class="related-posts">
    
      <li>
        <h3>
          <a href="/2021/09/01/sample-submission/">
            Sample Submission
            <small>01 Sep 2021 | 
    <a class="content-tag" href="/tags/#multi-agent"> multi-agent </a>
  
    <a class="content-tag" href="/tags/#reinforcement-learning"> reinforcement-learning </a>
  
    <a class="content-tag" href="/tags/#experimental-techniques"> experimental techniques </a>
  
    <a class="content-tag" href="/tags/#monotonicity"> monotonicity </a>
  </small>
          </a>
        </h3>
      </li>
    
      <li>
        <h3>
          <a href="/2020/04/02/example-content/">
            Example content (Basic Markdown)
            <small>02 Apr 2020 | 
    <a class="content-tag" href="/tags/#multi-agent"> multi-agent </a>
  
    <a class="content-tag" href="/tags/#reinforcement-learning"> reinforcement-learning </a>
  
    <a class="content-tag" href="/tags/#experimental-techniques"> experimental techniques </a>
  
    <a class="content-tag" href="/tags/#monotonicity"> monotonicity </a>
  </small>
          </a>
        </h3>
      </li>
    
  </ul>
</div>


<script src="https://utteranc.es/client.js"
        repo="iclr-blog-track/iclr-blog-track.github.io"
        issue-term="pathname"
        label="utterance"
        theme="boxy-light"
        crossorigin="anonymous"
        >
</script>


      </div>
    </div>

    <label for="sidebar-checkbox" class="sidebar-toggle"></label>

    <script src='/public/js/script.js'></script>
  </body>
</html>
