A Causality-Inspired Spatial-Temporal Return Decomposition Approach for Multi-Agent Reinforcement Learning

Yudi Zhang; Yali Du; Biwei Huang; Meng Fang; Mykola Pechenizkiy

A Causality-Inspired Spatial-Temporal Return Decomposition Approach for Multi-Agent Reinforcement Learning

Yudi Zhang, Yali Du, Biwei Huang, Meng Fang, Mykola Pechenizkiy

Published: 30 Oct 2024, Last Modified: 07 Nov 2024CRL@NeurIPS 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Causal Reinforcement Learning, Multi-agent Reinforcement Learning, Credit Assignment

Abstract: Multi-agent reinforcement learning (MARL) has been largely developed to solve multi-agent cooperation problems. However, it remains insufficiently developed for explaining decision-making processes. This challenge becomes particularly pronounced with delayed rewards, especially episodic ones, as credit allocation must be accounted for along both the temporal and spatial axes involving multiple agents. In this paper, we propose a CAusally-inspired Spatial-Temporal return decomposition method, named CAST, to tackle episodic reward in cooperative MARL. We provide interpretable return decomposition and allow the complexity of multi-agent dynamics by relaxing the common assumption. Specifically, along the temporal dimension, episodic long-term return satisfies a linear summation of team rewards from all time steps. More interestingly, in the spatial dimension, beyond a simple linear summation of individual rewards, team rewards are allowed to be general nonlinear mixtures of individual rewards, facilitating more reasonable and precise credit allocation. We theoretically show that, under the proposed framework, the team rewards, individual rewards, and underlying causal relationships are identifiable, which naturally introduces additional structure constraints to enhance the interpretability of reward redistribution. Our experiments demonstrate state-of-the-art results in MPE and its variants, and the provided visualization of the causal structure demonstrates the interpretability of our method.

Submission Number: 28

Loading