Balancing Extremes: Exploiting the Performance Spectrum from Best to Worst in Multi-Agent Systems

ICLR 2026 Conference Submission22042 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-agent Reinforcement Learning, Cooperative MARL, Reward Shaping, Optimism, Pessimism, Stability Regularization.
TL;DR: Balancing Extremes in Multi-Agent Systems (BEMAS): a decentralized, proximity-aware MARL framework that encourages exploration, moves toward strong peers, and repels weak behaviors using optimism and pessimism shaping signals.
Abstract: Coordinating exploration and avoiding suboptimal equilibria remain central challenges in cooperative multi-agent reinforcement learning (MARL). We introduce BEMAS (Balancing Extremes in Multi-Agent Systems), a decentralized and proximity-aware framework that exploits the performance spectrum that naturally emerges as agents learn at different rates. During training, agents exchange bounded local messages to identify their best and worst neighbors via phase-aware TD-error scoring: a curiosity score to encourage coordinated exploration and a performance score to guide exploitation. BEMAS couples two shaped signals: (i) optimism, an intrinsic bonus equal to the optimistic action-value gap, with respect to the best neighbor; and (ii) pessimism, a relative-entropy-based repulsion that discourages imitation of the worst neighbor. A schedule down-weights optimism and up-weights pessimism over training, and execution is fully decentralized with no communication. We establish boundedness of the shaping terms and add a Bayesian stability regularizer that limits policy surprise, resulting in stable updates. Across a standard cooperative MARL benchmark, BEMAS proves superior performance compared to baselines, with ablations isolating the contributions of optimism and pessimism. Motivated by group learning theory, the proposed framework provides a simple mechanism that moves toward the best peers and repels weak behaviors.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 22042
Loading