Keywords: Multi-Agent Systems, Collaborative Coding, Agent Coordination, Benchmarking, Large Language Models
TL;DR: We introduce CooperBench, a benchmark for collaborative coding, and show that state-of-the-art coding agents perform substantially worse when required to coordinate, revealing coordination rather than coding skill as the primary bottleneck
Abstract: Resolving team conflicts requires not only task-specific competence, but also social intelligence to find common ground and build consensus.
Similarly, as AI agents increasingly collaborate on complex work, they must develop coordination capabilities to function as effective teammates. Yet we hypothesize that current agents lack these capabilities.
To test this hypothesis, we introduce CooperBench, a benchmark of 600 collaborative coding tasks spanning 12 libraries and 4 languages. Each task assigns two agents independently implementable features that may conflict without coordination. Tasks are grounded in real open-source repositories with expert-written tests. Evaluating SOTA coding agents, we observe the *curse of coordination*: a 30\% average drop in success when agents work together versus individually, across all task difficulties. This contrasts with human teams, where adding teammates typically improves productivity. We identify three failure modes: (1) communication channels become jammed with vague, ill-timed, and inaccurate messages; (2) even with communication, agents deviate from their commitments; and (3) agents often hold incorrect expectations about others' plans, observations, and communication.
In large-scale simulations, we observe rare emergent coordination behaviors such as role division, resource division, and negotiation.
Beyond introducing a new collaborative coding benchmark, our work calls for a research shift from pursuing individual agent capability to *social intelligence*: understanding others, communicating effectively, and coordinating actions.
Submission Number: 27
Loading