Keywords: web agents, safety, trustworthiness, benchmark, policy compliance, enterprise workflows, Completion Under Policy, CuP, Risk Ratio, human-in-the-loop, policy hierarchy, robustness, error handling, evaluation, agentic systems, LLM-based agents, autonomous browsing
TL;DR: ST-WebAgentBench is a policy-aware benchmark with new metrics (CuP, Risk Ratio) that evaluates web agents’ safety and trustworthiness across 222 enterprise-style tasks, revealing large gaps between raw completion and policy-compliant success.
Abstract: Autonomous web agents solve complex browsing tasks, yet existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust. To integrate these agents into critical workflows, safety and trustworthiness (ST) are prerequisite conditions for adoption. We introduce \textbf{\textsc{ST-WebAgentBench}}, a configurable and extensible framework designed as a first step toward enterprise-grade evaluation.
Each of its 222 tasks is paired with ST policies, concise rules that encode constraints, and is scored along six orthogonal dimensions (e.g., user consent, robustness). Beyond raw task success, we propose the \textit{Completion Under Policy} (\textit{CuP}) metric, which credits only completions that respect all applicable policies, and the \textit{Risk Ratio}, which quantifies ST breaches across dimensions.
Evaluating three open state-of-the-art agents shows their average CuP is less than two-thirds of their nominal completion rate, revealing substantial safety gaps. To support growth and adaptation to new domains, ST-WebAgentBench provides modular code and extensible templates that enable new workflows to be incorporated with minimal effort, offering a practical foundation for advancing trustworthy web agents at scale.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 13776
Loading