Keywords: Agent evaluation, AI agents, AI evaluation, model evaluation, science of evaluations, LLM evaluation, evaluation methodology, language model, foundation model
TL;DR: We define and formalize the agentic evaluation paradigm and present a survey of open methodological challenges in agentic evaluation of AI systems.
Abstract: With the increased generality and advanced reasoning capabilities of AI systems, an increasing number of AI evaluations are _agentic evaluations_: evaluations involving complex tasks requiring environmental interaction, as opposed to knowledge-based question-answer benchmarks. However, no work has explored the methodological challenges of agentic evaluations or the practices necessary to ensure their validity, reliability, replicability, and efficiency. In this (work-in-progress) paper, we (1) define and formalize the agentic evaluation paradigm; (2) survey and analyze methodological problems in agentic evaluations; and (3) discuss the implications of agentic evaluations for AI governance. Our hope is to improve the state of agentic evaluations of AI systems, systematize the methodological work in this domain, and contribute to the establishment of a science of AI evaluations.
Submission Number: 1
Loading