Revisiting Cooperative Off-Policy Multi-Agent Reinforcement Learning

Yueheng Li; Guangming Xie; Zongqing Lu

Revisiting Cooperative Off-Policy Multi-Agent Reinforcement Learning

Yueheng Li, Guangming Xie, Zongqing Lu

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Cooperative Multi-Agent Reinforcement Learning (MARL) has become a critical tool for addressing complex real-world problems. However, off-policy MARL methods, which rely on joint Q-functions, face significant scalability challenges due to the exponentially growing joint action space. In this work, we highlight a critical yet often overlooked issue: erroneous Q-target estimation, primarily caused by extrapolation error. Our analysis reveals that this error becomes increasingly severe as the number of agents grows, leading to unique challenges in MARL due to its expansive joint action space and the decentralized execution paradigm. To address these challenges, we propose a suite of techniques tailored for off-policy MARL, including annealed multi-step bootstrapping, averaged Q-targets, and restricted action representation. Experimental results demonstrate that these methods effectively mitigate erroneous estimations, yielding substantial performance improvements in challenging benchmarks such as SMAC, SMACv2, and Google Research Football.

Lay Summary: In cooperative Multi-Agent Reinforcement Learning (MARL), multiple agents learn to work together to solve complex tasks. Off-policy methods—which aim to train agents more efficiently—are popular in this area. However, they often struggle when the number of agents grows. This is because these methods must consider all possible combinations of team actions, which quickly becomes overwhelming and leads to inaccurate predictions during training. Our research shows that these prediction errors, though often overlooked, play a major role in the poor performance of off-policy MARL at larger scales. Existing solutions only partially address the issue, leaving room for improvement. To tackle this, we introduce three simple and broadly applicable techniques that make the learning process more reliable. These strategies help the system focus on more useful training signals and reduce the impact of inaccurate predictions. We test our methods on several challenging multi-agent tasks, and see consistent improvements in performance. Our findings not only enhance current off-policy MARL approaches but also offer a new perspective on how to train large teams of AI agents more effectively.

Primary Area: Reinforcement Learning->Multi-agent

Keywords: cooperative multi-agent reinforcement learning, off-policy multi-agent reinforcement learning, value factorization, extrapolation error

Submission Number: 15095

Loading