Keywords: web agent, agent evaluation, webarena, benchmark
Abstract: Autonomous web agents have been growing in interest within the research community and are increasingly operating in more complex multistep browser workflows. However, widely used benchmarks can misestimate performance due to underspecified goals and fragile evaluators—challenges typical of normal benchmark maturation rather than flaws in the paradigm.
This hinders accurate performance assessment and limits benchmark utility for guiding method development.
We present WebArena Verified, a reproducible re‑evaluation of WebArena that retains its containerized environments while strengthening measurement.
We audit all 812 tasks, repair misaligned checks and clarify ambiguous instructions, replace substring matching with type‑aware exact matching with semantic normalization, verify backend state for mutation tasks, and adopt a structured JSON schema with explicit status codes for deterministic scoring.
For reporting, we provide per‑template macro‑averaged metrics, 95\% confidence intervals, and failure‑mode breakdowns.
Our new evaluator reduces the false‑negative rate by 11.3 percentage points compared to the original scoring pipeline.
To reduce evaluation overhead, we introduce WebArena Verified Hard, a 258‑task subset that retains difficult tasks, reduces runtime by 68.2\%, and maintains discriminative power and coverage.
WebArena Verified remains drop‑in compatible with existing agents, requiring only minimal changes and enabling faithful comparisons. Code, data, and evaluation tools will be released to support reproducibility.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 16394
Loading