Sparse Topology Pairwise Scoring for Large-Scale Multi-Agent Reinforcement Learning

Zhibo Deng; Chengkun Li; Yong Zhang; Feng Liang; Xiaoxi Zhang; Xiping Hu

Sparse Topology Pairwise Scoring for Large-Scale Multi-Agent Reinforcement Learning

Zhibo Deng, Chengkun Li, Yong Zhang, Feng Liang, Xiaoxi Zhang, Xiping Hu

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: communication learning; multi-agent reinforcement learning

TL;DR: We propose SOPS, a scalable multi-agent communication framework for large-scale scenarios that learns dynamic topologies over efficient backbones to improve adaptivity and efficiency.

Abstract: In multi-agent reinforcement learning (MARL), the problem of partial observability and stochasticity can be alleviated by enabling agents to access additional information about others through communication. However, in large-scale settings, communication among agents leads to a quadratic increase in the number of pairwise links, resulting in excessive bandwidth consumption and memory bottlenecks. Previous studies primarily aimed to solve this problem through learning globally optimal communication graphs, but such designs inevitably incur rapidly escalating complexity as the agent population increases. In this work, we propose a scalable communication scheme for large-scale MARL, termed $\textit{Sparse tOpology Pairwise Scoring}$ (SOPS). We hypothesize that leveraging pairwise relations among agents over an efficient backbone topology can enhance cooperative policies, and we adopt an exponential graph as a scalable backbone topology with a small diameter. Based on this backbone, we learn a probabilistic subgraph distribution parameterized by a pairwise scoring network that adaptively incorporates agent states and edge-type embeddings. To enable gradient-based optimization through discrete subgraph sampling, we employ Gumbel-Sigmoid reparameterization, whose differentiable nature allows the entire framework to be trained in an end-to-end manner. Overall, SOPS maintains high communication efficiency while adapting dynamically to task requirements and temporal variations. Evaluation results show that SOPS significantly outperforms existing state-of-the-art methods across cooperative benchmarks of diverse scales, consistently achieving higher rewards and faster convergence. SOPS also exhibits robust zero-shot transfer capabilities, enabling a model trained on a smaller scale to effectively apply to larger-scale scenarios.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 11161

Loading