Policy Consistency in Multi-Agent Reinforcement Learning with Mixed Reward

Yang Zhang; Yunjian Xu; Chengwei Zhang; Chao Wang; Zhihe Yang; Bo Tang; Edward Chung

Policy Consistency in Multi-Agent Reinforcement Learning with Mixed Reward

Yang Zhang, Yunjian Xu, Chengwei Zhang, Chao Wang, Zhihe Yang, Bo Tang, Edward Chung

24 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-agent System, Reinforcement Learning, Sparse Reward, Policy Consistency, Individual Reward

TL;DR: We propose a novel multi-agent policy optimization approach to ensure the consistency between learned and optimal team policies in environments with sparse team rewards and individual rewards.

Abstract: The sparsity of team rewards poses a significant challenge that hinders the effective learning of optimal team policies in cooperative multi-agent reinforcement learning. One common approach to mitigate this issue involves augmenting sparse rewards with individual rewards to guide policy training. However, a significant drawback of such approaches is that modifying the reward function can potentially alter the optimal policy. To tackle this challenge, we propose a novel multi-agent policy optimization approach that ensures consistency between the mixed policy (learned from a combination of individual and team rewards) and the team policy (based solely on team rewards), through a new policy consistency constraint that aligns the returns of both policies in policy optimization model. We further develop an iterated policy optimization procedure to solve the formulated problem, deriving an approximate optimization objective for each iteration of the mixed and team policies. Experimental evaluation conducted in the StarCraft II Multi-Agent Challenge Environment (SMAC), Multi-Agent Particle Environment (MPE), and Google Research Football (GRF) environments demonstrate that our proposed approach effectively addresses the policy inconsistency problem, ${\it i.e.}$, it consistently outperforms strong baseline methods.

Supplementary Material: zip

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3387

Loading