Policy consensus-based distributed deterministic multi-agent reinforcement learning over directed graphs

Yifan Hu, Junjie Fu, Guanghui Wen, Changyin Sun

Published: 29 Aug 2024, Last Modified: 30 Sept 2024IEEE Transactions on Artificial IntelligenceEveryoneRevisionsCC BY 4.0

Abstract: Learning efficient coordination policies over continuous state and action spaces remains a huge challenge for existing distributed multi-agent reinforcement learning (MARL) algorithms. In this article, the classic deterministic policy gradient method is extended to the distributed MARL domain to handle the continuous control policy learning issue for a team of homogeneous agents connected through a directed graph. A theoretical on-policy distributed actor-critic algorithm is firstly proposed based on a local deterministic policy gradient theorem, which considers observation-based policies, and incorporates consensus updates for the critic and actor parameters. Stochastic approximation theory is then used to obtain asymptotic convergence results of the algorithm under standard assumptions. Thereafter, a practical distributed deterministic actor-critic algorithm is proposed by integrating the theoretical algorithm with the deep reinforcement learning training architecture, which achieves better scalability, exploration ability and data efficiency. Simulations are carried out in standard MARL environments with continuous action spaces, where the results demonstrate that the proposed distributed algorithm achieves comparable learning performance to solid centralized trained baselines, while demanding much less communication resources.