PCRL: Priority Convention Reinforcement Learning for Microscopically Sequencable Multi-agent Problems
Keywords: reinforcement learning, social convention, priority, large discrete action space, multi-agent, cooperation
TL;DR: we propose a priority convention reinforcement learning to tackle microscopically sequencable multi-agent problems which may involve up to 6e10 candidate actions at each decision step.
Abstract: Reinforcement learning (RL) has played an important role in tackling the decision problems emerging from agent fields. However, RL still has challenges in tackling multi-agent large-discrete-action-space (LDAS) problems, possibly resulting from large agent numbers. At each decision step, a multi-agent LDAS problem is often faced with an unaffordable number of candidate actions. Existing work has mainly tackled these challenges utilizing indirect approaches such as continuation relaxation and sub-sampling, which may lack solution quality guarantees from continuation to discretization. In this work, we propose to embed agreed priority conventions into reinforcement learning (PCRL) to directly tackle the microscopically sequenceable multi-agent LDAS problems. Priority conventions include position-based agent priority to break symmetries and prescribed action priority to break ties. In a microscopically sequenceable multi-agent problem, the centralized planner, at each decision step of the whole system, generates an action vector (each component of the vector is for an agent and is generated in a micro-step) by considering the conventions. The action vector is generated sequentially when microscopically viewed, and such generation will not miss the optimal action vector, and can help RL's exploitation around the lexicographic-smallest optimal action vector. Proper learning schemes and action-selection schemes have been designed to make the embedding reality. The effectiveness and superiority of PCRL have been validated by experiments on multi-agent applications, including the multi-agent complete coverage planning application (involving up to $4^{18}>6.8\times 10^{10}$ candidate actions at each decision step) and the cooperative pong game (state-based and pixel-based, respectively), showing PCRL's LDAS dealing ability and high optimality-finding ability than the joint-action RL methods and heuristic algorithms.
Supplementary Material: zip
10 Replies
Loading