Abstract: Existing web agents struggle with brittle planning, hallucinations caused by insufficient observations, and frequent execution failures due to unreliable element selection. We introduce WebEVA, a multimodal web agent that improves reliability across observation, decision, and execution through five key innovations: (1) generating high-level task requirements instead of low-level plans, (2) filtering elements via inner-text matching to reduce the number of candidate elements, (3) ranking icons and images via a fine-tuned ModernBERT model, (4) introducing an observation stage that analyzes the screenshot to assess task progress before deciding the next action, and (5) separating action selection (what to do next) from action parsing (how the action should be carried out), enabling clearer grounding prior to execution.
WebEVA sets a new state-of-the-art among open-source systems on the
WebVoyager and Online-Mind2Web
benchmarks. It achieves 82.1% in human evaluation and 90.8% in automated evaluation on WebVoyager,
and 37.5% and 34.6% respectively on Online-Mind2Web.
We release WebEVA to support future research in web automation.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, NLP Applications, Language Modeling, Dialogue and Interactive Systems
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 2720
Loading