Multi-objective Multi-agent Reinforcement Learning with Pareto-stationary Convergence

Pengcheng Dai; Lingjie Duan

Multi-objective Multi-agent Reinforcement Learning with Pareto-stationary Convergence

Pengcheng Dai, Lingjie Duan

25 Sept 2024 (modified: 28 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-objective, multi-agent reinforcement learning, Pareto-stationary convergence

Abstract: Multi-objective multi-agent reinforcement learning (MOMARL) problems frequently arise in real world applications (e.g., path planning for swarm robots) or have not been explored well. To find Pareto-optimum is NP-hard, and thus some multi-objective algorithms have emerged recently to provide Pareto-stationary solution centrally, managed by a single agent. Yet, they cannot deal with MOMARL problem, as the dimension of global state-action $(\boldsymbol{s},\boldsymbol{a})$ grows exponentially with the number of spatially distributed agents. To tackle this issue, we design a novel graph-truncated $Q$-function approximation method for each agent $i$, which does not require the global state-action $(\boldsymbol{s},\boldsymbol{a})$ but only the neighborhood state-action $(s\_{\mathcal{N}^{\kappa}\_{i}},a\_{\mathcal{N}^{\kappa}\_{i}})$ of its $\kappa$-hop neighbors. To further reduce the dimension to state-action $(s\_{\mathcal{N}^{\kappa}\_{i}},a\_{i})$ with only local action, we further develop a concept of action-averaged $Q$-function and establish the equivalence between using graph-truncated $Q$-function and action-averaged $Q$-function for policy gradient approximation. Accordingly, we develop a distributed scalable algorithm with linear function approximation and we prove that it successfully converges Pareto-stationary solution at rate $\mathcal{O}(1/T)$ that is inversely proportional to time domain $T$. Finally, we run simulations in a robot path planning environment and show our algorithm converges to greater multi-objective values as compared to the latest MORL algorithm, and performs close to the central optimum with much shorter running time.

Supplementary Material: zip

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4037

Loading