WARP: Measuring and Mitigating Evaluation Awareness in Browser-Agent Safety Benchmarks

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: agents, security, evaluation awareness, prompt injections, browser-use agents
TL;DR: Browser-agent safety benchmarks overreport safety because they don't scale & agents recognize them as tests; we present a 50-task indirect-prompt-injection benchmark with increased realism and high attack success rates on SOTA models.
Abstract: Browser-agent safety benchmarks underwrite deployment safety claims, but only if model behavior under evaluation generalizes to deployment. We present the first cross-benchmark measurement of evaluation awareness in browser-agent indirect prompt-injection (IPI) benchmarks, across six frontier models and four published benchmarks (DoomArena, WASP, AgentDojo, InjecAgent; n=4,608 trajectories). When asked to classify trajectory purpose, models read over half of trajectories on every benchmark as evaluations, and three of four benchmarks elicit unprompted awareness verbalization. Awareness has a causal effect on reported attack-success rate (ASR): adding an explicit evaluation cue drops ASR by 10.9 pp on DoomArena, while a length-matched non-eval cue does not. A residual-stream probe on an open-weight model also separates attacked from safe trajectories at ≈2σ. Together, these results indicate that frontier models systematically distinguish IPI safety benchmarks from deployment, affecting reported ASR validity. Building upon this diagnosis, we release WARP (Wrapper-based, Adaptive, Realistic Pipeline), a benchmark-generation pipeline that builds challenging and realistic IPI evaluation tasks on top of any seedable synthetic browser environment. Instantiated as a 50-task benchmark on WebArena Verified, WARP attains 80.7% headline ASR across the same six models, substantially above raw DoomArena (22.3%) and WASP (27.0%) rates, while lowering eval-awareness rates relative to a no-rewrite baseline. We advocate routine reporting of awareness. Code, benchmark, and the awareness-measurement toolkit are released.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 465
Loading