Why Do Multi-Agent LLM Systems Fail?

Mert Cemri; Melissa Z Pan; Shuyi Yang; Lakshya A Agrawal; Bhavya Chopra; Rishabh Tiwari; Kurt Keutzer; Aditya Parameswaran; Dan Klein; Kannan Ramchandran; Matei Zaharia; Joseph E. Gonzalez; Ion Stoica

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-agent systems, large language models, llm, compound ai systems, agents, ai

TL;DR: A dataset of multi-agent system traces, and a systematic analysis of failures in multi-agent LLM systems, featuring a structured taxonomy and an automated evaluation pipeline.

Abstract: Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAD, a comprehensive dataset of 1000+ annotated traces collected across 7 popular MAS frameworks. We create MAD, the first Multi-Agent System Failure Dataset, to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAD, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter-annotator agreement (κ = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) specification issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAD to analyze failure patterns across models (GPT4, Claude 3) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAD), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.

Submission Number: 45

Loading