Keywords: LLM Agent, Multi-Agent System, Cooperation, Coordination, Partner Modeling, Theory of Mind, game-theoretic learning, zero-shot cooperation, strategic AI, trace-grounded evaluation
TL;DR: We show that high task success in LLM-based multi-agent systems can hide poor strategic coordination, and introduce trace-grounded metrics to measure redundancy, delayed response, verifier dependence, and partner-specific fragility.
Abstract: Large language models are increasingly deployed as autonomous agents in multi-agent environments, but task completion alone may not indicate reliable strategic coordination. We study this issue in Collab-Overcooked, a cooperative benchmark where LLM agents coordinate through natural language under role asymmetry and verifier-guided execution. We model each rollout as the realized outcome of a finite-horizon cooperative Markov game and evaluate whether successful teams also exhibit efficient coordination: low redundancy, timely partner response, useful signaling, and robustness to partner identity. We extend Collab-Overcooked with non-required-cooperation layouts in which either agent can complete the task alone, allowing collaboration to be evaluated as an efficiency strategy rather than a feasibility requirement. We introduce a trace-grounded evaluation framework combining structured interdependence with Trace-Grounded Communication Outcome auditing. Across three LLM families and directed role pairings, we find that high task success often coexists with substantial coordination cost, including redundant actions, delayed or unfulfilled requests, verifier corrections, and strong role-pairing effects. Theory-of-Mind scaffolds improve some metrics but do not reliably eliminate these failures. Our results suggest that completion-based evaluation is insufficient for LLM-based multi-agent systems and that trace-grounded coordination metrics are needed to measure strategic behavior.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Paper Type: Standard paper
Submission Number: 31
Loading