Keywords: Multi-Agent Systems, Benchmarks, LLMs
TL;DR: We introduce a framework for constructing and evaluating multi-agent environments driven by large language models, enabling systematic benchmarking of agentic coordination and communication at scale.
Abstract: Agents capable of making decisions from complex, unstructured instructions have seen a surge with the rise of large language models (LLMs). However, their ability to coordinate with other agents while following instructions is still an active area of research. To facilitate research in this area, we introduce a framework for designing scalable environments to evaluate coordination in agentic LLM networks, called Coordinating LLM Agents Benchmark (CoLLAB). CoLLAB adapts a widely used classical cooperative multi-agent problem-solving framework called Distributed Constraint Optimization Problems (DCOPs), and extends it with unstructured instructions and communication, making it directly relevant for studying coordination in agentic LLM networks. We provide a design blueprint for how CoLLAB environments can scale across multiple dimensions. Finally, we implement three case-study environments within this framework and evaluate several LLM-based agent configurations. We then quantitatively analyze LLM-generated solutions against classical symbolic solvers to directly assess their quality. In addition, we demonstrate how CoLLAB supports seamless scaling of environment complexity, allowing us to design increasingly challenging coordination tasks and assess how different agents adapt.
Archival Option: The authors of this submission do *not* want it to appear in the archival proceedings.
Submission Number: 123
Loading