Local Advantage Networks for Multi-Agent Reinforcement Learning in Dec-POMDPs

Raphaël Avalos; Mathieu Reymond; Ann Nowe; Diederik M Roijers

Local Advantage Networks for Multi-Agent Reinforcement Learning in Dec-POMDPs

Raphaël Avalos, Mathieu Reymond, Ann Nowe, Diederik M Roijers

Published: 25 Oct 2023, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Many recent successful off-policy multi-agent reinforcement learning (MARL) algorithms for cooperative partially observable environments focus on finding factorized value functions, leading to convoluted network structures. Building on the structure of independent Q-learners, our LAN algorithm takes a radically different approach, leveraging a dueling architecture to learn for each agent a decentralized best-response policies via individual advantage functions. The learning is stabilized by a centralized critic whose primary objective is to reduce the moving target problem of the individual advantages. The critic, whose network's size is independent of the number of agents, is cast aside after learning. Evaluation on the StarCraft II multi-agent challenge benchmark shows that LAN reaches state-of-the-art performance and is highly scalable with respect to the number of agents, opening up a promising alternative direction for MARL research.

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=hF7n7ncYw1

Changes Since Last Submission: We will list the different points made by the reviewers and how we addressed them. - About the four claims of LAN The original submission contained four claims explaining LAN’s performance: better update targets (BT), mitigating the moving target problem (MT), mitigating multi-agent credit assignment (MACA), and reducing learning complexity (LC). Based on reviewer feedback, we separated these claims into two properties (MT and MACA) and two intuitions (BT and LC). The reviewers were satisfied by our MT claim and that we followed their recommendation to add an experiment for MACA. The challenges in proving BT and LC lie in deriving ablations for LAN, as swapping the centralized value function yields IQL (used as a baseline), but swapping the local Advantage would prevent decentralized execution. Training a centralized agent beforehand to obtain a fixed centralized value function is also not feasible due to the exponential increase in the action space. - Statistical significance and standard deviation of Figure 3 We augmented the number of runs from 6 to 10. In SMAC the metric used is the average median and first and third quartile. As figure 3 shows the average over all 14 maps of the median win rate, we added the average of the first and third quartile to figure 3 in place of the standard deviation. - Initial claim of LAN reaching SOTA on all maps We revised this statement in the introduction. Although LAN outperforms QPLEX by 10% in the aggregated score of SMAC, it does under-perform on one map, and the 10% increase is achieved by obtaining good results (+90% and +40%) on two super hard maps. - About Hu et al., "Rethinking the Implementation Tricks and Monotonicity Constraint in Cooperative MARL" 2021 We acknowledged that QMIX could potentially be fine-tuned for better results. However, for this study we decided to keep the original hyper-parameters of QMIX (no parallel environments, network size, etc.) for all the algorithms and to use one set of hyperparameters for all the maps. This is similar to QPLEX/QMIX’s evaluation method. Regarding the performance of QMIX-Adam in Hu et al. we would like to highlight two points. a) Experiments use a different version of SMAC believed to be easier and making their results not comparable - acknowledged by authors of the paper and QMIX. b) They note that Adam-QMIX requires parallel environments to perform well which is an additional variation on the original algorithm. - P_a inaccuracy The difference between our equation and the one Foerster et al. (2017) is that we assumed for the notations that observations were deterministic. We modified the equation. - Target value All the networks are learned in parallel using Equation (3) as target. The explanation for the training is located above Equation (3). Equation (4) (formerly 5) highlights the induced target for the advantage to draw a connection with the COMA paper that focuses on the multi-agent credit assignment problem. - Discussion around bias and variance This discussion was reformulated in Intuition 1. We removed Appendix E (added during the rebuttal). We realized that comparing gradient norms of different methods with different architecture would not allow us to draw any meaningful conclusion. As we could not find additional experiments or metrics to substantiate the BT claim we marked it as an intuition (see above). - Loss of information by summing the embeddings We agree with the meta-reviewer that summing the embeddings results in an information loss. This is why in our architecture we do not sum directly the hidden states of the agents’ RNN but first compute an embedding with the hope that the embedding representation is summable. We tried alternatives such as concatenation in earlier experiments and the current architecture performs better. We updated the paper in the second paragraph of Architecture with this information. - Scalability We acknowledge that scalability is not only about the number of parameters but also about sample complexity. However, limiting the number of parameters is highly important to extend to many agents. A network size invariance in the number of parameters as in LAN (the difference of LAN in Tables 1/2 results in a change of observation size), would allow the algorithm to scale to a large number of agents efficiently. The growth of QPLEX’s mixing network on the other hand might limit its applicability. Additionally: - We kept the same version of SMAC as the baseline to ensure fairness in our comparison and to avoid running unnecessary experiments as we have limited compute resources (the meta-reviewer did not find changing the version necessary). - Fixed the related work position and added a visual comparison - We clarified the claim: "LAN does not have any restriction on the family of decentralized functions" - We added more technical information regarding Eq. (2) next to the equation. - We clarified the fact that "LAN can represent all decentralized policies"

Supplementary Material: zip

Assigned Action Editor: ~Michael_Bowling1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 1295

Loading