Keywords: multi-agent cooperation, reinforcement learning algorithm
Abstract: Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning (MARL) methods with linear or monotonic value decomposition can not ensure the optimal consistency (i.e. the correspondence between the individual greedy actions and the maximal true Q value), leading to instability and poor coordination. Existing methods focus on addressing the representation limitation through learning the complete expressiveness, which is impractical and may deteriorate the performance in complex tasks. In this paper, we introduce the True-Global-Max (TGM) condition for linear and monotonic value decomposition to achieve the optimal consistency directly, where the TGM condition can be ensured under the unique stability of the optimal greedy action. Therefore, we propose the greedy-based value representation (GVR), which stabilises the optimal greedy action via inferior target shaping and destabilises non-optimal greedy actions via superior experience replay. We conduct experiments on various benchmarks, where GVR significantly outperforms state-of-the-art baselines. Experiment results demonstrate that our method can meet the optimal consistency under sufficient exploration and is more efficient than methods with complete expressiveness capability.
Supplementary Material: zip
11 Replies
Loading