Multiagent Reinforcement Learning: Rollout and Policy Iteration for POMDP With Application to Multirobot Problems

Sushmita Bhattacharya, Siva Kailas, Sahil Badyal, Stephanie Gil, Dimitri P. Bertsekas

Published: 01 Jan 2024, Last Modified: 15 Aug 2025IEEE Trans. Robotics 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this article, we consider the computational and communication challenges of partially observable multiagent sequential decision-making problems. We present algorithms that simultaneously or sequentially optimize the agents' controls by using multistep lookahead, truncated rollout with a known base policy, and a terminal cost function approximation. In particular: 1) we consider multiagent rollout algorithms that dramatically reduce required computation while preserving the key policy improvement property of the standard rollout method. We improve our multiagent rollout policy by incorporating it in an offline approximate policy iteration scheme, and we apply an additional “online play” scheme enhancing offline approximation architectures; 2) we consider the imperfect communication case and provide various extensions to our rollout methods to deal with this case; and 3) we demonstrate the performance of our methods in extensive simulations by applying our method to a challenging partially observable multiagent sequential repair problem (state space size $10^{37}$ and control space size $10^{7}$). Our extensive simulations demonstrate that our methods produce better policies for large and complex multiagent problems in comparison with existing methods, including POMCP, MADDPG, and work well where other methods fail to scale up.