The Influence of Scaffolds on Coordination Scaling Laws in LLM Agents

Published: 28 Sept 2025, Last Modified: 21 Oct 2025SEA @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-agent systems, coordination, large language models, agents, scaffolds, benchmark
Abstract: As large language models improve in capability, they are increasingly taking on more agentic and interactive roles in multi-agent settings that demand effective communication and coordination. In order to measure a model's capabilities in these settings, new benchmarks are quickly emerging to study language-based, multi-agent interaction, often by adding language scaffolds on top of existing multi-agent environments. However, when evaluating agents on such benchmarks, an agent's performance can be significantly influenced by implicit factors related to the design of the scaffolds, rather than the inherent properties of the agents. Moreover, it is unclear if coordination among agents in these settings follows scaling laws. We consider one such environment---the popular collaborative cooking environment, Collab-Overcooked---and characterize how scaffolding plays a role in successful collaborations between models of varying sizes. We perform empirical evaluations on the collaborative capabilities of agents and find that, as long as models are given clear instructions on \textit{how} to collaborate, their capabilities follow positive scaling laws in both self-play and cross-play. However, without a scaffold that explicitly defines how the collaboration should be done, we find models struggle to develop effective methods for collaboration, and scaling laws break down. Our experiments highlight how subtle changes in agent scaffolds can drastically impact their collaborative capabilities and raises questions on how to design evaluations for agents that may have to collaborate with open-ended partners.
Archival Option: The authors of this submission want it to appear in the archival proceedings.
Submission Number: 105
Loading