Keywords: Web agent, Evaluation benchmark, Multimodal Large Language Model
Abstract: Recent advances in foundation models have enabled autonomous web agents to navigate and interact with real-world websites. However, existing benchmarks primarily focus on general-purpose web navigation, offering limited coverage of information-centric environments. Evaluations that depend on live websites further hinder reproducibility due to constantly changing content.
The arXiv platform provides a natural balance between realism and reproducibility, featuring hierarchically structured and information-centric webpages without privacy-sensitive interactions.
Building on this foundation, we present WebArxiv, a benchmark for reproducible evaluation of multimodal web agents within the arXiv environment. WebArxiv is built from static webpage snapshots and includes 510 time-invariant tasks, each with a unique and deterministic ground truth.
We evaluate a range of foundation-model-based web agents, revealing that WebArxiv poses significant challenges for current web agents.
Behavioral analyses reveal a common failure mode in which agents over-rely on fixed interaction histories, leading to incomplete or repetitive reasoning. To address this limitation, we equip the agents with a lightweight dynamic memory mechanism that enables adaptive retrieval and reasoning over relevant context, thereby enhancing their overall navigation performance.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Web agent, Evaluation benchmark, Multimodal Large Language Model
Languages Studied: English
Submission Number: 5086
Loading