WebArXiv: A Reproducible Benchmark for Evaluating Multimodal Web Agents on arXiv Tasks

ACL ARR 2026 January Submission5086 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Web agent, Evaluation benchmark, Multimodal Large Language Model
Abstract: Recent advances in foundation models have enabled autonomous web agents to navigate and interact with real-world websites. However, existing benchmarks primarily focus on general-purpose web navigation, offering limited coverage of information-centric environments. Evaluations that depend on live websites further hinder reproducibility due to constantly changing content. The arXiv platform provides a natural balance between realism and reproducibility, featuring hierarchically structured and information-centric webpages without privacy-sensitive interactions. Building on this foundation, we present WebArxiv, a benchmark for reproducible evaluation of multimodal web agents within the arXiv environment. WebArxiv is built from static webpage snapshots and includes 510 time-invariant tasks, each with a unique and deterministic ground truth. We evaluate a range of foundation-model-based web agents, revealing that WebArxiv poses significant challenges for current web agents. Behavioral analyses reveal a common failure mode in which agents over-rely on fixed interaction histories, leading to incomplete or repetitive reasoning. To address this limitation, we equip the agents with a lightweight dynamic memory mechanism that enables adaptive retrieval and reasoning over relevant context, thereby enhancing their overall navigation performance.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Web agent, Evaluation benchmark, Multimodal Large Language Model
Languages Studied: English
Submission Number: 5086
Loading