Agent-as-a-Judge: Evaluate Agents with Agents

Mingchen Zhuge; Changsheng Zhao; Dylan R. Ashley; Wenyi Wang; Dmitrii Khizbullin; Yunyang Xiong; Zechun Liu; Ernie Chang; Raghuraman Krishnamoorthi; Yuandong Tian; Yangyang Shi; Vikas Chandra; Jürgen Schmidhuber

Agent-as-a-Judge: Evaluate Agents with Agents

Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: This paper introduces the Agent-as-a-Judge, and we also collected a dataset helps verify our idea.

Abstract: Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes---ignoring the step-by-step nature of the thinking done by agentic systems---or require excessive manual labour. To address this, we introduce the **Agent-as-a-Judge** framework, wherein agentic systems are used to evaluate agentic systems. This is a natural extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving processes for more precise evaluations. We apply the Agent-as-a-Judge framework to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present **DevAI**, a new benchmark of 55 realistic AI code generation tasks. DevAI includes rich manual annotations, like a total of 365 hierarchical solution requirements, which make it particularly suitable for an agentic evaluator. We benchmark three of the top code-generating agentic systems using Agent-as-a-Judge and find that our framework dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that this work represents a concrete step towards enabling vastly more sophisticated agentic systems. To help that, our dataset and the full implementation of Agent-as-a-Judge will be publically available at https://github.com/metauto-ai/agent-as-a-judge

Lay Summary: Existing evaluation methods either focus solely on final outcomes—overlooking the agent’s reasoning—or depend on costly human review, making them impractical for large-scale systems . We introduce the “Agent-as-a-Judge” framework, in which one autonomous agent observes and provides step-by-step, fine-grained assessments of another agent’s task execution . To validate our approach, we created DevAI, a new benchmark comprising 55 real-world AI development tasks and 365 hierarchical requirements . In code-generation experiments, Agent-as-a-Judge agreed with human expert evaluations about 90% of the time—substantially outperforming the 70% agreement rate of previous LLM-as-a-Judge methods. Moreover, our framework cuts evaluation time and cost by roughly 97%, reducing effort from 86 hours and 1,297 USD to approximately 2 hours and 31 USD. By delivering human-level reliability at a fraction of the cost, Agent-as-a-Judge paves the way for rapid, trustworthy analysis of complex multi-agent systems

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/metauto-ai/agent-as-a-judge

Primary Area: Applications

Keywords: Code Generation, Agent-as-a-Judge, AI Developer, AI Judge, LLM

Submission Number: 1636

Loading