Abstract: The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce a novel benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of scenarios, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. Our findings reveal surprising weaknesses in state-of-the-art models, including strong proprietary models like GPT-4o and reasoning models like o4-mini. Many long chain of thought reasoning models such as R1-Onevision and LLaVA-CoT struggle to outperform even a simple random agent baseline in agent-agent collaboration, indicating a potential growth area in their communication abilities.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Blake_Aaron_Richards1
Submission Number: 5005
Loading