Multi-Task Sequence Models Generalize in Offline Multi-Agent Reinforcement Learning

Multi-Task Sequence Models Generalize in Offline Multi-Agent Reinforcement Learning

ICLR 2026 Conference Submission19096 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-agent reinforcement learning, reinforcement learning, offline reinforcement learning

TL;DR: In offline MARL, increasing task diversity is far more important for generalisation than scaling dataset size. Multi-task models trained with our proposed design choices achieve substantially better zero-shot transfer than single-task models.

Abstract: Recent sequence model architectures have demonstrated great promise in offline multi-agent reinforcement learning (MARL). However, even for this expressive model class, generalising to tasks unseen in the training data remains a core challenge. A sensible response to this challenge is to simply scale the amount of offline data available for training. Yet, in this work, we find that task diversity has a stronger influence on generalisation than sheer dataset size. To obtain our findings, we study offline MARL sequence models trained on single-task datasets, clearly demonstrating their limited ability to zero-shot transfer to held-out test tasks. Leveraging this insight, we train and test multi-task versions of offline sequence modeling architectures. We identify three key design choices for successful offline multi-task training: (i) task-balanced mini-batches, (ii) treating value estimation as classification and (iii) agent masking to handle variable team sizes. Using multi-task datasets from three challenging cooperative environments (Connector, RWARE, and LBF), we investigate generalisation to unseen tasks and the scaling behaviour of our multi-task offline algorithms. We show that our multi-task sequence models generalise better across all environments compared to single-task models, and achieve a mean improvement of 219% on held-out test tasks. Moreover, our offline MARL sequence models consistently outperform behaviour cloning (a surprisingly strong baseline). Our results clearly show that scaling task diversity by increasing the number of tasks used during training leads to improved generalisation gains over simply scaling the dataset size at a fixed level of task diversity.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 19096

Loading