Keywords: Tool-augmented Agent, LLM Agent, Evaluation, Benchmark
Abstract: Driven by recent advancements in tool-augmented Large Language Model (LLM) agents, comprehensive benchmark datasets for evaluating these tool-augmented agents are being actively developed. Although these benchmarks incorporate increasingly complex user requests and a diverse array of tools, the evaluation methods for most of them remain limited to answer matching. However, as the number of steps required to resolve a user request increases, a proper evaluation of an agent's performance must go beyond the final answer to also assess the problem-solving trajectory, including previously ignored aspects such as efficiency, hallucinations, and adaptivity. The most straightforward method for evaluating these aspects is to compare the trajectory of the agent with a ground-truth trajectory, but this approach is fundamentally limited since annotating all possible ground-truth trajectories is prohibitively expensive. To address these significant gaps, we introduce TRACE, a framework for the multi-dimensional evaluation of tool-augmented LLM agent performance. By incorporating evidence store, TRACE enables a multi-faceted analysis and evaluation of an agent's reasoning trajectory, eliminating the need for a predefined ground-truth trajectory. To validate our framework, we develop a new meta-evaluation dataset by augmenting existing benchmarks with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates these complex behaviors in a scalable and cost-effective manner, even with small open-source LLMs.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 11190
Loading