Keywords: Agent, Evaluation, LLM
TL;DR: We propose a new method for inducing metrics for evaluating and improving agents from open-ended human feedback.
Abstract: Agents are predominantly evaluated and optimized via task success metrics, which are coarse,
rely on manual design from experts, and fail to reward intermediate emergent behaviors.
We propose AutoLibra, a framework for agent evaluation, that transforms open-ended
human feedback e.g. “If you find that the button is disabled, don’t click it again”, or “This
agent has too much autonomy to decide what to do on its own” into metrics for evaluating
fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding
feedback to an agent’s behavior, clustering similar positive and negative behaviors, and
creating concrete metrics with clear definitions and concrete examples, which can be used for
prompting LLM-as-a-Judge as evaluators. We further propose two meta-metrics to evaluate
the alignment of a set of (induced) metrics with open feedback: “coverage” and “redundancy”.
Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra’s ability
to induce more concrete agent evaluation metrics than the ones proposed in previous
agent evaluation benchmarks and discover new metrics to analyze agents. We also present
two applications of AutoLibra in agent improvement: First, we show that AutoLibra
serve human prompt engineers for diagonalize agent failures and improve prompts iterative.
Moreover, we find that AutoLibra can induce metrics for automatic optimization for agents,
which makes agents improve through self-regulation. Our results suggest that AutoLibra is a
powerful task-agnostic tool for evaluating and improving language agents.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22008
Loading