Agent-as-a-Judge: Evaluate Agents with Agents

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper introduces the Agent-as-a-Judge, and we also collected a dataset helps verify our idea.
Abstract: Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes---ignoring the step-by-step nature of the thinking done by agentic systems---or require excessive manual labour. To address this, we introduce the **Agent-as-a-Judge** framework, wherein agentic systems are used to evaluate agentic systems. This is a natural extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving processes for more precise evaluations. We apply the Agent-as-a-Judge framework to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present **DevAI**, a new benchmark of 55 realistic AI code generation tasks. DevAI includes rich manual annotations, like a total of 365 hierarchical solution requirements, which make it particularly suitable for an agentic evaluator. We benchmark three of the top code-generating agentic systems using Agent-as-a-Judge and find that our framework dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that this work represents a concrete step towards enabling vastly more sophisticated agentic systems. To help that, our dataset and the full implementation of Agent-as-a-Judge will be publically available at https://github.com/metauto-ai/agent-as-a-judge
Lay Summary: Existing evaluation methods either focus solely on final outcomes—overlooking the agent’s reasoning—or depend on costly human review, making them impractical for large-scale systems . We introduce the “Agent-as-a-Judge” framework, in which one autonomous agent observes and provides step-by-step, fine-grained assessments of another agent’s task execution . To validate our approach, we created DevAI, a new benchmark comprising 55 real-world AI development tasks and 365 hierarchical requirements . In code-generation experiments, Agent-as-a-Judge agreed with human expert evaluations about 90% of the time—substantially outperforming the 70% agreement rate of previous LLM-as-a-Judge methods. Moreover, our framework cuts evaluation time and cost by roughly 97%, reducing effort from 86 hours and 1,297 USD to approximately 2 hours and 31 USD. By delivering human-level reliability at a fraction of the cost, Agent-as-a-Judge paves the way for rapid, trustworthy analysis of complex multi-agent systems
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/metauto-ai/agent-as-a-judge
Primary Area: Applications
Keywords: Code Generation, Agent-as-a-Judge, AI Developer, AI Judge, LLM
Submission Number: 1636
Loading