WebEVA: A Web Agent with Reliable Element Selection

WebEVA: A Web Agent with Reliable Element Selection

ACL ARR 2025 May Submission2720 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Existing web agents struggle with brittle planning, hallucinations caused by insufficient observations, and frequent execution failures due to unreliable element selection. We introduce WebEVA, a multimodal web agent that improves reliability across observation, decision, and execution through five key innovations: (1) generating high-level task requirements instead of low-level plans, (2) filtering elements via inner-text matching to reduce the number of candidate elements, (3) ranking icons and images via a fine-tuned ModernBERT model, (4) introducing an observation stage that analyzes the screenshot to assess task progress before deciding the next action, and (5) separating action selection (what to do next) from action parsing (how the action should be carried out), enabling clearer grounding prior to execution. WebEVA sets a new state-of-the-art among open-source systems on the WebVoyager and Online-Mind2Web benchmarks. It achieves 82.1% in human evaluation and 90.8% in automated evaluation on WebVoyager, and 37.5% and 34.6% respectively on Online-Mind2Web. We release WebEVA to support future research in web automation.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, NLP Applications, Language Modeling, Dialogue and Interactive Systems

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 2720

Loading