COMMA: A Communicative Multimodal Multi-Agent Benchmark

Timothy Ossowski; Danyal Maqbool; Jixuan Chen; Zefan Cai; Tyler J. Bradshaw; Junjie Hu

COMMA: A Communicative Multimodal Multi-Agent Benchmark

Timothy Ossowski, Danyal Maqbool, Jixuan Chen, Zefan Cai, Tyler J. Bradshaw, Junjie Hu

Published: 10 Sept 2025, Last Modified: 10 Sept 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce COMMA: a novel puzzle benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of multimodal puzzles, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. Our findings reveal surprising weaknesses in state-of-the-art models, including strong proprietary models like GPT-4o and reasoning models like o4-mini. Many chain of thought reasoning models such as R1-Onevision and LLaVA-CoT struggle to outperform even a random baseline in agent-agent collaboration, indicating a potential growth area in their communication abilities.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We added an experiment to analyze the effect of withholding memory information in our collaborative setting (see top of page 11).

Code: https://github.com/tossowski/COMMA

Assigned Action Editor: ~Blake_Aaron_Richards1

Submission Number: 5005

Loading