Recursive Reasoning for Sample-Efficient Multi-Agent Reinforcement Learning

Recursive Reasoning for Sample-Efficient Multi-Agent Reinforcement Learning

ICLR 2026 Conference Submission21973 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, multi agent systems, cooperative learning, policy gradients

TL;DR: We present a theoretically grounded recursive reasoning framework that enhances cooperation in multi-agent reinforcement learning in both on and off-policy algorithms.

Abstract: Policy gradient algorithms for deep multi-agent reinforcement learning (MARL) typically employ an update that responds to the current strategies of other agents. While being straightforward, this approach does not account for the updates of other agents within the same update step, resulting in miscoordination and reduced sample efficiency. In this paper, we introduce methods that recursively refine the policy gradient by updating each agent against the updated policies of other agents within the same update step, speeding up the discovery of effective coordinated policies. We provide principled implementations of recursive reasoning in MARL by applying it to competitive multi-agent algorithms in both on and off-policy regimes. Empirically, we demonstrate superior performance and sample efficiency over existing deep MARL algorithms in StarCraft II and multi-agent MuJoCo. We theoretically prove that higher recursive reasoning in gradient-based methods with finite iterates achieves monotonic convergence to a local Nash equilibrium under certain conditions.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 21973

Loading