Keywords: Multi-agent reinforcement learning, Autonomous Driving, Population-based Training, Zero-shot Coordination
TL;DR: ReCord trains autonomous driving policies with non-reactive partner trajectories, improving zero-shot coordination with long-tail partners, achieving better performance in matrix-game and multi-agent driving experiments.
Abstract: Autonomous driving policies trained with self-play reinforcement learning (RL) can generalize to unseen scenarios, but they are trained primarily through interactions among copies of the same policy. As a result, they may fail to prepare for diverse and unfamiliar partner behaviors, which is safety-critical in autonomous driving, where other agents can be aggressive, non-reactive, or otherwise different from those seen during training.
Population-based training (PBT) addresses this limitation by training the ego policy with diverse pre-trained partners. However, conventional PBT typically executes partner policies online during ego training, making them reactive to the ego policy. We refer to this standard setting as reactive-PBT.
To address this limitation, we propose Replay Coordination (ReCord), which trains the ego policy on fixed trajectories replayed from a diverse partner population. By removing online partner adaptation, ReCord encourages robust coordination without relying on partners' yielding behavior. In both a matrix game and a multi-agent driving simulator, ReCord outperforms reactive-PBT, especially against non-reactive or weakly reactive partners, including replayed human trajectories, while remaining competitive under reactive evaluation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 66
Loading