Agent-Chained Policy Optimization

Daiki E. Matsunaga; Tri Wahyu Guntara; Junho Na; Scott Sanner; Pascal Poupart; Jongmin Lee; Kee-Eung Kim

Agent-Chained Policy Optimization

Daiki E. Matsunaga, Tri Wahyu Guntara, Junho Na, Scott Sanner, Pascal Poupart, Jongmin Lee, Kee-Eung Kim

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-Agent Reinforcement Learning, Centralized Training Decentralized Execution

Abstract: We study Cooperative Multi-Agent Reinforcement Learning (MARL), where the aim is to train decentralized policies that maximize a shared return. Existing methods typically employ either iterative best-response updates, which converge only to Nash Equilibria (NE) that may be far from the global optimum, or simultaneous learning with centralized critics, which lack convergence guarantees to the optimal joint policy without strong assumptions on decomposable value functions. We introduce the Agent-Chained Belief MDP (AC-BMDP), which reformulates MARL as a serialized decision process where agents act sequentially while maintaining beliefs over actions taken by preceding agents. This enables the definition of agent-specific value functions that are naturally chained together. Building on this framework, we propose Agent-Chained Policy Iteration (ACPI) and prove that it converges to the globally optimal joint policy. We further develop this framework into a practical actor–critic algorithm, Agent-Chained Policy Optimization (ACPO). On standard benchmarks, ACPO consistently surpasses state-of-the-art baselines, with the performance advantage growing significantly as the number of agents increases.

Primary Area: reinforcement learning

Submission Number: 23500

Loading