WARP: A Wrapper-Based, Adaptive, Realistic Pipeline for Reliable Web-Agent Robustness Testing

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: agents, security, evaluation awareness, prompt injections, browser-use agents
TL;DR: Browser-agent safety benchmarks overreport safety because they don't scale and agents recognize them as tests; we present a 50-task indirect-prompt-injection benchmark with increased realism and high attack success rates on SOTA models.
Abstract: AI agents deployed in browser settings face escalating security threats, yet existing browser-agent security benchmarks may systematically under-report deployment vulnerability. We identify three causes: attack sets that are static, synthetic environments that are difficult to extend, and evaluation awareness, where frontier agents recognize benchmark environments as tests and change their behavior. We present the first cross-benchmark measurement of evaluation awareness in browser-agent safety benchmarks, covering six frontier models and four published benchmarks (DoomArena, WASP, AgentDojo, InjecAgent). Every benchmark induces unprompted verbalized awareness, and over half of trajectories on each benchmark are read as evaluations under elicited probing. Through prompt-level and representation-level interventions, we further show that awareness causally biases reported attack success rates (ASR): adding cues that indicate an evaluation setting drops the mean ASR by 10.8 pp on DoomArena and 5.3 pp on WASP. Building on this diagnosis, we release WARP, a Wrapper-based, Adaptive, Realistic Pipeline that builds challenging and realistic IPI evaluation tasks on top of any synthetic browser environment. WARP has easy integration as a plug-and-play wrapper, an adaptive attacker loop, and advanced iteration for reducing evaluation awareness. We instantiate WARP as a 50-task benchmark on WebArena Verified that boasts high ASR on SOTA models, as well as report awareness alongside ASR. WARP moves the browser-agent robustness field towards more reliable evaluation results. We release our code, benchmark, and open-source tooling integrated with AgentLab for measuring evaluation realism.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 227
Loading