Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor RectificationDownload PDF

12 Oct 2021, 19:37Deep RL Workshop NeurIPS 2021Readers: Everyone
Keywords: Multi-Agent Reinforcement Learning (MARL), Offline reinforcement learning (RL), Offline Multi-Agent Reinforcement Learning
TL;DR: We propose a simple yet effective OMAR algorithm to tackle the offline MARL setting via a combination of first-order policy gradients and zeroth-order optimization methods, which achieves SOTA performance in multi-agent continuous control benchmarks.
Abstract: The idea of conservatism has led to significant progress in offline reinforcement learning (RL) where an agent learns from pre-collected datasets. However, it is still an open question to resolve offline RL in the more practical multi-agent setting as many real-world scenarios involve interaction among multiple agents. Given the recent success of transferring online RL algorithms to the multi-agent setting, one may expect that offline RL algorithms will also transfer to multi-agent settings directly. Surprisingly, when conservatism-based algorithms are applied to the multi-agent setting, the performance degrades significantly with an increasing number of agents. Towards mitigating the degradation, we identify that a key issue that the landscape of the value function can be non-concave and policy gradient improvements are prone to local optima. Multiple agents exacerbate the problem since the suboptimal policy by any agent could lead to uncoordinated global failure. Following this intuition, we propose a simple yet effective method, \underline{O}ffline \underline{M}ulti-Agent RL with \underline{A}ctor \underline{R}ectification (OMAR), to tackle this critical challenge via an effective combination of first-order policy gradient and zeroth-order optimization methods for the actor to better optimize the conservative value function. Despite the simplicity, OMAR significantly outperforms strong baselines with state-of-the-art performance in multi-agent continuous control benchmarks.
Supplementary Material: zip
0 Replies