Abstract: Multi-Agent Systems (MAS) built from Large Language Models (LLMs) offer significant potential for complex problem-solving, yet their optimal configuration is challenging, with performance typically evaluable only after resource-intensive execution. Addressing the underexplored area of MAS performance predictability, this paper investigates whether and how accurately MAS outcomes can be forecasted. We propose and evaluate a methodology that involves monitoring MAS operations during execution, capturing agent inputs and outputs, and transforming this data into system-specific statistical indicators. These indicators are then used to train a regression model to predict overall task performance. Conducting experiments across five distinct MAS architectures and three benchmark tasks, we demonstrate that MAS performance is significantly predictable, achieving Spearman rank correlations typically ranging from $\textbf{0.76}$ to $\textbf{0.94}$ between predicted and actual scores. Notably, our findings indicate that the global statistics required for these predictions can be accurately estimated from as little as 10\% of the total operational data-generating events, still yielding a high correlation of $\textbf{0.82}$. Further analysis reveals that metrics quantifying individual agent capabilities are the most influential factors in performance prediction. This work underscores the feasibility of reliably predicting MAS performance, offering a path towards more efficient design, configuration, and deployment of MASs.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Language Modeling, Generation, Machine Learning for NLP
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: english
Submission Number: 5542
Loading