$f$-Divergence Policy Optimization in Fully Decentralized Cooperative MARL

Published: 31 Mar 2025, Last Modified: 31 Mar 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Independent learning is a straightforward solution for fully decentralized learning in cooperative multi-agent reinforcement learning (MARL). The study of independent learning has a history of decades, and the representatives, such as independent Q-learning and independent PPO, can achieve good performances on several benchmarks. However, most independent learning algorithms lack convergence guarantees or theoretical support. In this paper, we propose a general formulation of independent policy optimization, $f$-divergence policy optimization. We hope that a more general policy optimization formulation will provide deeper insights into fully decentralized learning. We demonstrate the generality of this formulation and analyze its limitations. Based on this formulation, we further propose a novel independent learning algorithm, TVPO, which theoretically guarantees convergence. Empirically, we demonstrate that TVPO outperforms state-of-the-art fully decentralized learning methods on three popular cooperative MARL benchmarks, thereby verifying the efficacy of TVPO.
Submission Length: Regular submission (no more than 12 pages of main content)
Supplementary Material: zip
Assigned Action Editor: ~Baoxiang_Wang1
Submission Number: 3906
Loading