Abstract: The study of fully decentralized learning or independent learning in cooperative multi-agent reinforcement learning has a history of decades. Recent empirical studies have shown that independent PPO (IPPO) can achieve good performance, comparable to or even better than the methods of centralized training with decentralized execution, in several benchmarks. However, a decentralized actor-critic algorithm with a convergence guarantee is still an open problem. In this paper, we propose decentralized policy optimization (DPO), a decentralized actor-critic algorithm with monotonic improvement and convergence guarantee. We derive a novel decentralized surrogate for policy optimization such that the monotonic improvement of joint policy can be guaranteed by each agent independently optimizing the surrogate. For practical implementation, this decentralized surrogate can be realized by two adaptive coefficients for policy optimization at each agent. Empirically, we evaluate DPO, IPPO, and independent Q-learning (IQL) in a variety of cooperative multi-agent tasks, covering discrete and continuous action spaces, as well as fully and partially observable environments. The results show DPO outperforms both IPPO and IQL in most tasks, which serves as evidence for our theoretical results. The code is available at https://github.com/PKU-RL/DPO.
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://github.com/PKU-RL/DPO
Supplementary Material: zip
Assigned Action Editor: ~Steven_Stenberg_Hansen1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1455
Loading