Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Wonjoong Kim; Sangwu Park; Yeonjun In; Sein Kim; Dongha Lee; Chanyoung Park

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee, Chanyoung Park

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Tool-augmented Agent, LLM Agent, Evaluation, Benchmark

Abstract: Driven by recent advancements in tool-augmented Large Language Model (LLM) agents, comprehensive benchmark datasets for evaluating these tool-augmented agents are being actively developed. Although these benchmarks incorporate increasingly complex user requests and a diverse array of tools, the evaluation methods for most of them remain limited to answer matching. However, as the number of steps required to resolve a user request increases, a proper evaluation of an agent's performance must go beyond the final answer to also assess the problem-solving trajectory, including previously ignored aspects such as efficiency, hallucinations, and adaptivity. The most straightforward method for evaluating these aspects is to compare the trajectory of the agent with a ground-truth trajectory, but this approach is fundamentally limited since annotating all possible ground-truth trajectories is prohibitively expensive. To address these significant gaps, we introduce TRACE, a framework for the multi-dimensional evaluation of tool-augmented LLM agent performance. By incorporating evidence store, TRACE enables a multi-faceted analysis and evaluation of an agent's reasoning trajectory, eliminating the need for a predefined ground-truth trajectory. To validate our framework, we develop a new meta-evaluation dataset by augmenting existing benchmarks with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates these complex behaviors in a scalable and cost-effective manner, even with small open-source LLMs.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 11190

Loading