Keywords: Web Agents; Web Failure Injections; Network Proxy; Robustness Evaluation; Black box Approach; Security
TL;DR: WAREX is a tool that injects common web failures to evaluate the reliability of LLM web agents, without making any modifications to agent code or benchmarks.
Abstract: Recent advances in browser-based large language model (LLM) agents have demonstrated promise for automating tasks ranging from simple form filling to complex activities such as hotel booking and online shopping. Current benchmarks measure agent performance in controlled environments, such as containers or stable networks, where websites behave deterministically. In real-world conditions, users encounter instability arising from multiple sources, including client-side and server-side failures, insecure HTTPS connections, or broader system failures. Moreover, real-world websites are vulnerable to threats such as cross-site scripting (XSS), malicious pop-ups, and general site modifications, all of which can compromise both safety and task completion. To address this gap, we introduce WAREX, a plug-and-play tool that integrates with existing web agent benchmarks to simulate common website failures. WAREX uses a proxy-based approach to enable the evaluation of the robustness of web agents. We apply WAREX across three widely used benchmarks — WebArena, WebVoyager, and REAL — and show that introducing simulated web failures leads to substantial drops in task success rates for state-of-the-art agents. Our results reveal fundamental weaknesses in current web agent benchmarks and establish WAREX as a practical evaluation infrastructure for testing web agents under realistic failure conditions.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 9773
Loading