ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Ido Levy; Ben wiesel; Sami Marreed; Alon Oved; Avi Yaeli; Segev Shlomov

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Ido Levy, Ben wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: web agents, safety, trustworthiness, benchmark, policy compliance, enterprise workflows, Completion Under Policy, CuP, Risk Ratio, human-in-the-loop, policy hierarchy, robustness, error handling, evaluation, agentic systems, LLM-based agents, autonomous browsing

TL;DR: ST-WebAgentBench is a policy-aware benchmark with new metrics (CuP, Risk Ratio) that evaluates web agents’ safety and trustworthiness across 222 enterprise-style tasks, revealing large gaps between raw completion and policy-compliant success.

Abstract: Autonomous web agents solve complex browsing tasks, yet existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust. To integrate these agents into critical workflows, safety and trustworthiness (ST) are prerequisite conditions for adoption. We introduce \textbf{\textsc{ST-WebAgentBench}}, a configurable and extensible framework designed as a first step toward enterprise-grade evaluation. Each of its 222 tasks is paired with ST policies, concise rules that encode constraints, and is scored along six orthogonal dimensions (e.g., user consent, robustness). Beyond raw task success, we propose the \textit{Completion Under Policy} (\textit{CuP}) metric, which credits only completions that respect all applicable policies, and the \textit{Risk Ratio}, which quantifies ST breaches across dimensions. Evaluating three open state-of-the-art agents shows their average CuP is less than two-thirds of their nominal completion rate, revealing substantial safety gaps. To support growth and adaptation to new domains, ST-WebAgentBench provides modular code and extensible templates that enable new workflows to be incorporated with minimal effort, offering a practical foundation for advancing trustworthy web agents at scale.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 13776

Loading