DOP: Off-Policy Multi-Agent Decomposed Policy Gradients

Yihan Wang; Beining Han; Tonghan Wang; Heng Dong; Chongjie Zhang

DOP: Off-Policy Multi-Agent Decomposed Policy Gradients

Yihan Wang, Beining Han, Tonghan Wang, Heng Dong, Chongjie Zhang

Published: 12 Jan 2021, Last Modified: 05 May 2023ICLR 2021 PosterReaders: Everyone

Keywords: Multi-Agent Reinforcement Learning, Multi-Agent Policy Gradients

Abstract: Multi-agent policy gradient (MAPG) methods recently witness vigorous progress. However, there is a significant performance discrepancy between MAPG methods and state-of-the-art multi-agent value-based approaches. In this paper, we investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP). This method introduces the idea of value function decomposition into the multi-agent actor-critic framework. Based on this idea, DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment in both discrete and continuous action spaces. We formally show that DOP critics have sufficient representational capability to guarantee convergence. In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms. Demonstrative videos are available at https://sites.google.com/view/dop-mapg/.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

One-sentence Summary: We propose an off-policy multi-agent decomposed policy gradient method, addressing the drawbacks that prevent existing multi-agent policy gradient methods from achieving state-of-the-art performance.

Supplementary Material: zip

Data: [SMAC](https://paperswithcode.com/dataset/smac)

11 Replies

Loading