DLM: A Scalable Decision Language Model for Multi-Agent Sequential Decision in SMAC Tasks

DLM: A Scalable Decision Language Model for Multi-Agent Sequential Decision in SMAC Tasks

ICLR 2026 Conference Submission15549 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Offline Multi-Agent Reinforcement Learning, Decision Language Model, Large Language Models, Group Relative Policy Optimization

TL;DR: We propose the Decision Language Model (DLM), a scalable framework for offline multi-agent sequential decision-making across tasks that reformulates decisions as dialogue-style sequence modeling, achieving state-of-the-art zero-shot generalization.

Abstract: Building a scalable model from offline datasets to tackle a broad spectrum of multi-agent sequential decision-making problems across tasks is a crucial step toward reusable and generalizable decision intelligence. However, the mainstream offline multi-agent reinforcement learning (MARL) methods lack generalization due to their reliance on fixed observation formats and action spaces. In contrast, language models offer flexible input representations that are not constrained by predefined dimensions. Motivated by this, we propose the decision language model (DLM), a framework that formulates decision-making as a dialogue-style sequence prediction problem. DLM is trained in two stages: a supervised fine-tuning (SFT) phase that leverages dialogue-style datasets to enable centralized training with inter-agent context, generating coordinated actions consistent with environment constraints; and a group relative policy optimization (GRPO) phase that further trains DLM-SFT to enhance robustness to out-of-distribution (OOD) actions through lightweight reward functions, yielding DLM-GRPO. Despite its simple design, DLM-SFT matches the performance of leading offline MARL methods across all tasks on the benchmark using only observation and action data. DLM-GRPO further improves execution reliability by significantly reducing OOD action risks and achieves strong zero-shot generalization to unseen tasks, reaching state-of-the-art performance with a single unified model.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 15549

Loading