Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation

Juan Claude Formanek; Callum Rhys Tilbury; Louise Beyers; Jonathan Phillip Shock; Arnu Pretorius

Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation

Juan Claude Formanek, Callum Rhys Tilbury, Louise Beyers, Jonathan Phillip Shock, Arnu Pretorius

Published: 26 Sept 2024, Last Modified: 13 Nov 2024NeurIPS 2024 Track Datasets and Benchmarks PosterEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: Offline Multi-Agent Reinforcement Learning, Multi-Agent Reinforcement Learning, Offline Reinforcement Learning, Reinforcement Learning

TL;DR: We highlight several issues in offline MARL, show that simple well-implemented baselines can produce SOTA results, and propose standards for evaluation to improve future work.

Abstract: Offline multi-agent reinforcement learning (MARL) is an emerging field with great promise for real-world applications. Unfortunately, the current state of research in offline MARL is plagued by inconsistencies in baselines and evaluation protocols, which ultimately makes it difficult to accurately assess progress, trust newly proposed innovations, and allow researchers to easily build upon prior work. In this paper, we firstly identify significant shortcomings in existing methodologies for measuring the performance of novel algorithms through a representative study of published offline MARL work. Secondly, by directly comparing to this prior work, we demonstrate that simple, well-implemented baselines can achieve state-of-the-art (SOTA) results across a wide range of tasks. Specifically, we show that on 35 out of 47 datasets used in prior work (almost 75\% of cases), we match or surpass the performance of the current purported SOTA. Strikingly, our baselines often substantially outperform these more sophisticated algorithms. Finally, we correct for the shortcomings highlighted from this prior work by introducing a straightforward standardised methodology for evaluation and by providing our baseline implementations with statistically robust results across several scenarios, useful for comparisons in future work. Our proposal includes simple and sensible steps that are easy to adopt, which in combination with solid baselines and comparative results, could substantially improve the overall rigour of empirical science in offline MARL moving forward.

Submission Number: 1211

Loading