Learning to attend and reorder: Scalable policy optimization in large-scale multi-agent systems

Zhaohan Feng, Wei Xiao, Jian Sun, Jie Chen, Gang Wang

Published: 27 Mar 2026, Last Modified: 26 Jan 2026NeurocomputingEveryoneCC BY 4.0

Abstract: Scalability is a central challenge in multi-agent reinforcement learning (MARL), as real-world applications often require coordination among tens to hundreds of agents. As multi-agent systems (MAS) scale up, their inherent difficulties——partial observability, non-stationarity, and complex inter-agent dependencies——become increasingly pronounced. Existing approaches typically pursue scalability by encouraging grouped or hierarchical cooperation, but their limited flexibility—stemming from task-specific priors such as predefined role structures, fixed sub-task horizons, or perceivable sub-task boundaries—makes their performance heavily dependent on carefully hand-crafted designs, thereby restricting their effectiveness in large-scale MAS. To address these limitations, we propose Selective Attention–enhanced Multi-agent Policy Optimization (SAMPO), a concise yet effective framework for scalable multi-agent policy learning. SAMPO leverages attention scores to reorder each agent’s observations, thereby achieving permutation invariance in a simple manner and consequently reducing the complexity of the observation space. This design substantially improves learning efficiency in cooperative tasks involving up to hundreds of agents. Moreover, SAMPO introduces a selection mechanism [5], i.e., a module that adaptively selects which interactions or entities to focus on, into the attention computation. This mechanism dynamically determines the attention parameter matrices based on each agent’s internal state, thereby injecting nonlinearity and greatly enhancing the expressive capacity of attention encoding. By virtue of these designs, SAMPO eliminates the need for extensive manual tuning and hand-crafted coordination structures, demonstrating remarkable performance in large-scale multi-agent tasks. Empirical results show that, under a unified set of hyperparameters, SAMPO consistently outperforms state-of-the-art baselines across SMAC environments of varying scales, including those involving up to hundreds of agents.