Keywords: Agents, Redteaming, Evaluations, Reasoning, Foundation Models
TL;DR: Model parameters scale inversely with dark pattern robustness
Abstract: Deceptive UI designs, widely instantiated across the web and commonly known
as dark patterns, manipulate users into performing actions misaligned with their
goals. In this paper, we show that dark patterns are highly effective in steer-
ing agent trajectories, posing a significant risk to agent robustness. To quan-
tify this risk, we introduce DECEPTICON, an environment for testing individual
dark patterns in isolation. DECEPTICON includes 700 web navigation tasks with
dark patterns—600 generated tasks and 100 real-world tasks, designed to measure
instruction-following success and dark pattern effectiveness. Across state-of-the-
art agents, we find dark patterns successfully steer agent trajectories towards mali-
cious outcomes in over 70% of tested generated and real-world tasks—compared
to a human average of 31%. Moreover, we find that dark pattern effectiveness
correlates positively with model size and test-time reasoning, making larger, more
capable models more susceptible. Leading countermeasures against adversarial
attacks, including in-context prompting and guardrail models, fail to consistently
reduce the success rate of dark pattern interventions. Our findings reveal dark pat-
terns as a latent and unmitigated risk to web agents, highlighting the urgent need
for robust defenses against manipulative designs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22655
Loading