Keywords: LLM Agent, LLM Judge, Evaluation, Eval
TL;DR: The Agent GPA evaluation framework mirrors an agent's operational loop with goals, plans, and actions, and captures a broad range of agent failures, with LLM judges agreeing with human on error detection and localization from 80% to 95% of the time.
Abstract: We introduce the Agent GPA (Goal-Plan-Action) framework: an evaluation
paradigm based on an agent’s operational loop of setting goals, devising plans,
and executing actions. The framework includes five evaluation metrics: Goal Fulfillment, Logical Consistency, Execution Efficiency, Plan Quality, and Plan Adherence. Logical Consistency checks that an agent’s actions are consistent with
its prior actions. Execution Efficiency checks whether the agent executes in the
most efficient way to achieve its goal. Plan Quality checks whether an agent’s
plans are aligned with its goals; Plan Adherence checks if an agent’s actions are
aligned with its plan; and Goal Fulfillment checks that agent’s final outcomes
match the stated goals. Our experimental results on two benchmark datasets – the
public GAIA dataset and an internal dataset for a production-grade data agent –
show that this framework (a) provides a systematic way to cover a broad range
of agent failures, including all agent errors on the GAIA benchmark dataset; (b)
exhibits strong agreement between human and LLM judges, ranging from 80% to
over 95%; and (c) localizes errors with 86% agreement with human annotations
to enable targeted improvement of agent performance.
Primary Area: datasets and benchmarks
Submission Number: 21307
Loading