everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Although trust region policy optimization methods have achieved a lot of success in cooperative multi-agent tasks, most of them face a non-stationarity problem during the learning process. Recently, sequential trust region methods that update policies agent-by-agent have shed light on alleviating the non-stationarity problem. However, these methods are still less sample-efficient when compared to their counterparts (i.e., PPO) in a single-agent setting. To narrow this efficiency gap, we propose the Off-Policyness-aware Sequential Policy Optimization (OPSPO) method, which explicitly manages the off-policyness that arises from the sequential policy update process among multiple agents. We prove that our OPSPO has the tightness of the monotonic improvement bound compared with other trust region multi-agent learning methods. Finally, we demonstrate that our OPSPO consistently outperforms strong baselines under challenging multi-agent benchmarks, including StarCraftII micromanagement tasks, Multi-agent MuJoCo, and Google Research Football full game scenarios.