From the Wild Web to the Zoo: Benchmarking Web Agents with a Realistic Simulator

Published: 02 Mar 2026, Last Modified: 30 Mar 2026Agentic AI in the Wild: From Hallucinations to Reliable Autonomy PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: security, benchmark, multi-agent, web-agent
Abstract: Web agents are language-based AI systems capable of executing multi-step online workflows. They interact with the web in unexpected ways and present new classes of security challenges. Ensuring that these interactions are safe requires rigorous testing. However, the web's heterogeneity and lack of reproducibility, together with the complexity of language-model-driven agents, complicate rigorous security evaluation. Robustly evaluating web agents requires an environment that preserves the complexity they face in the wild while providing a level of reproducibility unavailable online. We introduce Zoo, a simulated web environment that enables realistic workflows across multiple interconnected web applications---including email, identity management, e-commerce platforms, collaborative tools, and web analytics---within a single network. Unlike the open web, it also provides full access to backend state and supports deterministic re-initialization, both of which are essential for robust verification of AI systems. Building on top of the Zoo infrastructure, we then implement a variety of proof-of-concept web agent evaluations designed to showcase Zoo's interconnected services and leverage the transparent backend for capability and security verification.
Submission Number: 46
Loading