ReMIX: Regret Minimization for Monotonic Value Function Factorization in Multi-Agent Reinforcement Learning

ReMIX: Regret Minimization for Monotonic Value Function Factorization in Multi-Agent Reinforcement Learning

TMLR Paper1010 Authors

29 Mar 2023 (modified: 17 Sept 2024)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Value function factorization methods have become a dominant approach for cooperative multi-agent reinforcement learning under a centralized training and decentralized execution paradigm. By factorizing the optimal joint action-value function using a monotonic mixing function of agents' utilities, these algorithms ensure the consistency between joint and local action selections for decentralized decision-making. Nevertheless, the use of monotonic mixing functions also induces representational limitations. Finding the optimal projection of an unrestricted mixing function onto monotonic function classes is still an open problem. To this end, we propose ReMIX, formulating this optimal projection problem for value function factorization as a regret minimization over the projection weights of different state-action values. Such an optimization problem can be relaxed and solved using the Lagrangian multiplier method to obtain the close-form optimal projection weights. By minimizing the resulting policy regret, we can narrow the gap between the optimal and the restricted monotonic mixing functions, thus obtaining an improved monotonic value function factorization. Our experimental results on Predator-Prey and StarCraft Multi-Agent Challenge environments demonstrate the effectiveness of our method, indicating the better capabilities of handling environments with non-monotonic value functions.

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=4yH7ouS43L&referrer=%5BTMLR%5D(%2Fgroup%3Fid%3DTMLR)

Changes Since Last Submission: Some changes have been made compared to the previous submission: 1. We solve the technical clarity and presentation issues in the previous version. For instance, we adopt a boldface policy $ \boldsymbol{\pi} $ to represent joint policy. Furthermore, since the training is centralized, using a state $ s $ solely consisting of state $ s $ and history $ \tau $ is also a common practice in existing works. Thus, we adopt state instead of state and history to avoid ambiguities, and all related definitions and terms are changed correspondingly. 2. In the background section, we add more descriptions of value decomposition approaches for cooperative multi-agent reinforcement learning tasks. This section was named “Weighting Scheme in WQMIX” in the related work section in the previous version. We use this new "Value Function Decomposition" section in the background section to explain the development of the value decomposition method and our motivation to find the optimal projection weight. 3. We use one additional section in related work to introduce multi-agent reinforcement learning and some standard methods in this area, which will provide some necessary background information about this domain. 4. We also show the measurement of the on-policy transition factor in the ablation experiment, which is achieved by using an additional separate replay buffer that collects recently used transitions in the implementation. This term indicates that if we can focus more on the on-policy transitions, it can provide us with better results. Even though the experimental results have validated that capturing the previous three key terms can deliver significant improvement, we used an approximated method to approach the on-policy term.

Assigned Action Editor: ~Marc_Lanctot1

Submission Number: 1010

Loading