Abstract: Proximal Policy Optimization (PPO) has achieved empirical successes in the field of single-agent reinforcement learning thanks to guaranteed monotonic improvement. The theoretical support makes its extension in multi-agent systems very attractive. However, existing PPO-based algorithms in cooperative multi-agent reinforcement learning (MARL) either lack the theoretical monotonic improvement guarantee or have inevitably restrictive settings, which greatly limit their applicable scenarios. In this paper, we propose a theoretically-justified and general multi-agent PPO algorithm for cooperative MARL called Full-Pipeline PPO (FP3O). The core idea of FP3O is to dynamically allocate agents to different optimization pipelines and perform the proposed one-separation trust region optimization for each pipeline. We prove in theory the monotonicity of joint policy improvement when executing the policy iteration procedure of FP3O. In addition, FP3O enjoys high generality since it avoids the restrictive factors that could arise in other existing PPO-based algorithms. In our experiments, FP3O outperforms other strong baselines on Multi-Agent MuJoCo and StarCraftII Multi-Agent Challenge benchmarks and also demonstrates its generality to the common network types (i.e., full parameter sharing, partial parameter sharing, and non-parameter sharing) and various multi-agent tasks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)
Supplementary Material: zip