CoSMAC: A Benchmark for Evaluating Communication and Coordination in LLM-Based Agents

CoSMAC: A Benchmark for Evaluating Communication and Coordination in LLM-Based Agents

ACL ARR 2026 January Submission10386 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multiagent Systems, Large Language Models, Natural Language Communication, Computation and Language, Machine Learning

Abstract: Large Language Models (LLMs) have recently demonstrated strong reasoning and communication abilities, motivating research into their potential as autonomous agents in multi-agent systems. In this work, we introduce Communicative SMAC (CoSMAC), a benchmark designed to systematically evaluate the communication and coordination capabilities of LLM-based agents. Built upon the well-established SMAC multi-agent reinforcement learning (MARL) environment, CoSMAC features a set of scenarios requiring varying degrees of micromanagement and communication, where agents must exchange information through natural language to achieve shared goals. We evaluate 8 state-of-the-art open-source and proprietary LLMs in zero-shot settings, analyzing model properties that are critical for communicative and cooperative behaviors. Based on these results, we then distill the Qwen2.5-7B model on the resulting dataset via supervised fine-tuning. We further compare the performance of LLM-based agents against a well-known MARL baseline trained without communication. Experimental results show that while LLMs struggle in scenarios demanding fine-grained micromanagement and spatial coordination, they can outperform the MARL baseline in tasks that rely more heavily on effective communication. The environment implementation and datasets can be found on GitHub repository.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: benchmarking

Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 10386

Loading