Keywords: Multi-Agent Reinforcement Learning, Centralized Training Decentralized Execution
Abstract: We study Cooperative Multi-Agent Reinforcement Learning (MARL), where the aim is to train decentralized policies that maximize a shared return. Existing methods typically employ either iterative best-response updates, which converge only to Nash Equilibria (NE) that may be far from the global optimum, or simultaneous learning with centralized critics, which lack convergence guarantees to the optimal joint policy without strong assumptions on
decomposable value functions.
We introduce the Agent-Chained Belief MDP (AC-BMDP), which reformulates MARL as a serialized decision process where agents act sequentially while maintaining beliefs over actions taken by preceding agents. This enables the definition of agent-specific value functions that are naturally chained together. Building on this framework, we propose Agent-Chained Policy Iteration (ACPI) and prove that it converges to the globally optimal joint policy.
We further develop this framework into a practical actor–critic algorithm, Agent-Chained Policy Optimization (ACPO). On standard benchmarks, ACPO consistently surpasses state-of-the-art baselines, with the performance advantage growing significantly as the number of agents increases.
Primary Area: reinforcement learning
Submission Number: 23500
Loading