Decentralized Policy Gradients for Optimizing Generalizable Policies in Multi-Agent Reinforcement Learning

Decentralized Policy Gradients for Optimizing Generalizable Policies in Multi-Agent Reinforcement Learning

TMLR Paper5607 Authors

12 Aug 2025 (modified: 07 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Parameter Sharing (PS) is a widely used practice in Multi-Agent Reinforcement Learning (MARL), where a single neural network is shared among all agents. Despite its efficiency and effectiveness, PS can occasionally result in suboptimal performance. While prior research has primarily addressed this issue from the perspective of update conflicts among different agents, we investigate it from an optimization standpoint. Specifically, we point out the analogy between PS in MARL and Centralized SGD (CSGD) in distributed learning and hypothesize that PS may inherit similar convergence and generalization issues as CSGD, such as lower convergence levels of key metrics and larger generalization gaps. To address these issues, we propose Decentralized Policy Gradients (DecPG), which leverages the principles of Decentralized SGD. We use an environment with additional noise injected into the observation and action spaces to evaluate the generalization of DecPG. Empirical results show that DecPG outperforms its centralized counterpart, PS, across various aspects---achieving higher rewards, smaller generalization gaps, and flatter reward landscapes. The results confirm that PS suffers from convergence and generalization issues similar to those of CSGD, and show that our DSGD-based method, DecPG, effectively mitigates these problems---offering a new optimization perspective on MARL algorithm performance.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Kamil_Ciosek1

Submission Number: 5607

Loading