Methodological Challenges in Agentic Evaluations of AI Systems

Kevin Wei; Stephen Guth; Gabriel Wu; Patricia Paskov

Methodological Challenges in Agentic Evaluations of AI Systems

Kevin Wei, Stephen Guth, Gabriel Wu, Patricia Paskov

Published: 05 Jun 2025, Last Modified: 25 Jul 2025ICML 2025 Workshop TAIG PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Agent evaluation, AI agents, AI evaluation, model evaluation, science of evaluations, LLM evaluation, evaluation methodology, language model, foundation model

TL;DR: We define and formalize the agentic evaluation paradigm and present a survey of open methodological challenges in agentic evaluation of AI systems.

Abstract: With the increased generality and advanced reasoning capabilities of AI systems, an increasing number of AI evaluations are _agentic evaluations_: evaluations involving complex tasks requiring environmental interaction, as opposed to knowledge-based question-answer benchmarks. However, no work has explored the methodological challenges of agentic evaluations or the practices necessary to ensure their validity, reliability, replicability, and efficiency. In this (work-in-progress) paper, we (1) define and formalize the agentic evaluation paradigm; (2) survey and analyze methodological problems in agentic evaluations; and (3) discuss the implications of agentic evaluations for AI governance. Our hope is to improve the state of agentic evaluations of AI systems, systematize the methodological work in this domain, and contribute to the establishment of a science of AI evaluations.

Submission Number: 1

Loading