Keywords: agent, benchmarking, evaluation
Abstract: Recent advances in large language models (LLMs) have produced capable web agents, and evaluating their action trajectories is critical for post-training data selection and feedback-driven improvement. However, this assessment space remains under-explored while existing benchmarks emphasize short, simple tasks and primarily evaluate trajectory correctness. As agent capabilities grow and attention shifts to realistic, complex scenarios, modern web agents routinely engage in long-horizon reasoning over dozens of turns, which poses new challenges for evaluation. To better suit real-world evaluation needs, we present **CREAT**—a **C**omprehensive **RE**ward benchmark for lengthy and complex web **A**gent **T**rajectories. CREAT is not only a benchmark containing challenging, high-order web browsing queries that demand long-horizon agentic reasoning, but also a comprehensive, fine-grained evaluation framework for assessing agent trajectories. It evaluates trajectories along five dimensions that are crucial for web agents, going beyond correctness alone. Experiments on 10 representative LLMs reveal weak sensitivity to hallucinations and a limited ability to separate necessary exploration from redundant actions, providing insights about whether current LLMs can serve as reliable judges for comprehensive agent trajectory evaluation.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,evaluation
Contribution Types: Data resources
Languages Studied: English
Submission Number: 8957
Loading