Towards Design of an Automated Judge for Multi-Agent Systems

Towards Design of an Automated Judge for Multi-Agent Systems

AAAI 2026 Workshop TrustAgent Submission66 Authors

Published: 20 Nov 2025, Last Modified: 09 Mar 2026AAAI 2026 TrustAgent Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: llm, multi-agent systems, llm-as-judge

TL;DR: An unsupervised framework of specialized LLM metric agents that accurately evaluates multi-agent systems and outperforms existing tools.

Abstract: Evaluating multi-agent systems (MAS) requires nuanced analysis across different levels of interaction, from individual agent behaviors to emergent collective dynamics. Existing methods relying on human annotation or single LLM evaluators struggle to capture this complexity. We introduce a novel unsupervised framework that addresses this gap through a collective of specialized LLM-based evaluators. Our architecture employs a comprehensive two-tier evaluation approach. Specialized evaluators assess individual competencies at the agent level, including observation alignment, tool selection, and state consistency. At the system level, dedicated evaluators analyze collective performance across task completion, role distribution, and complexity metrics. Each evaluator receives structured trace representations with explicit task context. The proposed AutoPumpkin (Automated Performance Understanding for Multi-agents Planning, Knowldge Integration and Negotiation) judge aggregates key agent- and system-level metrics alongside evaluator justifications to determine overall MAS success (binary classification) without requiring labeled training data. On the out-of-distribution TRAIL benchmark, AutoPumpkin achieves F1 0.85, substantially outperforming the TRAIL baseline (F1 0.71). On in-distribution data, AutoPumpkin maintains competitive performance (F1 0.65) compared to TRAIL (F1 0.634), demonstrating strong cross-dataset generalization. This hierarchical evaluation framework enables precise failure identification and targeted optimization across both agent and system levels while remaining scalable and interpretable.

Submission Number: 66

Loading