CREAT: A Comprehensive Reward Benchmark for Lengthy and Complex Web Agent Trajectories

CREAT: A Comprehensive Reward Benchmark for Lengthy and Complex Web Agent Trajectories

ACL ARR 2026 January Submission8957 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: agent, benchmarking, evaluation

Abstract: Recent advances in large language models (LLMs) have produced capable web agents, and evaluating their action trajectories is critical for post-training data selection and feedback-driven improvement. However, this assessment space remains under-explored while existing benchmarks emphasize short, simple tasks and primarily evaluate trajectory correctness. As agent capabilities grow and attention shifts to realistic, complex scenarios, modern web agents routinely engage in long-horizon reasoning over dozens of turns, which poses new challenges for evaluation. To better suit real-world evaluation needs, we present **CREAT**—a **C**omprehensive **RE**ward benchmark for lengthy and complex web **A**gent **T**rajectories. CREAT is not only a benchmark containing challenging, high-order web browsing queries that demand long-horizon agentic reasoning, but also a comprehensive, fine-grained evaluation framework for assessing agent trajectories. It evaluates trajectories along five dimensions that are crucial for web agents, going beyond correctness alone. Experiments on 10 representative LLMs reveal weak sensitivity to hallucinations and a limited ability to separate necessary exploration from redundant actions, providing insights about whether current LLMs can serve as reliable judges for comprehensive agent trajectory evaluation.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking,evaluation

Contribution Types: Data resources

Languages Studied: English

Submission Number: 8957

Loading