Detecting Pipeline Failures through Fine-Grained Analysis of Web Agents

Detecting Pipeline Failures through Fine-Grained Analysis of Web Agents

ACL ARR 2025 July Submission671 Authors

28 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Web agents powered by large language models (LLMs) can autonomously perform complex, multistep tasks in dynamic web environments. However, current evaluations mostly focus on the overall success while overlooking intermediate errors. This limits insight into failure modes and hinders systematic improvement. This work analyzes existing benchmarks and highlights the lack of fine-grained diagnostic tools. To address this gap, we propose a modular evaluation framework that decomposes agent pipelines into interpretable stages for detailed error analysis. Using the SeeAct framework and the Mind2Web dataset as a case study, we show how this approach reveals actionable weaknesses missed by standard metrics - paving the way for more robust and generalizable web agents.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: evaluation methodologies, vision language navigation, multimodal applications

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Previous URL: https://openreview.net/forum?id=GBDIAAQ4Pq

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: Yes, I want a different area chair for our submission

Reassignment Request Reviewers: Yes, I want a different set of reviewers

Software: zip

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 3 and Section 4

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: No

B5 Elaboration: Dataset statistics including details of domains are described in cited work

B6 Statistics For Data: Yes

B6 Elaboration: Section 3.2

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Appendix E (Model Specifications)

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 4

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 5

C4 Parameters For Packages: Yes

C4 Elaboration: Listed in Github Repo

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: No

E1 Elaboration: AI assistant (GPT-4o) used solely for improving style of scientific writing which was subsequently further improved by a human

Author Submission Checklist: yes

Submission Number: 671

Loading