AdvAgent: Controllable Blackbox Red-teaming on Web Agents

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce AdvAgent, a black-box red-teaming framework targeting generalist web agents, leveraging the feedback from target agents to generate adaptive adversarial prompts.
Abstract: Foundation model-based agents are increasingly used to automate complex tasks, enhancing efficiency and productivity. However, their access to sensitive resources and autonomous decision-making also introduce significant security risks, where successful attacks could lead to severe consequences. To systematically uncover these vulnerabilities, we propose AdvAgent, a black-box red-teaming framework for attacking web agents. Unlike existing approaches, AdvAgent employs a reinforcement learning-based pipeline to train an adversarial prompter model that optimizes adversarial prompts using feedback from the black-box agent. With careful attack design, these prompts effectively exploit agent weaknesses while maintaining stealthiness and controllability. Extensive evaluations demonstrate that AdvAgent achieves high success rates against state-of-the-art GPT-4-based web agents across diverse web tasks. Furthermore, we find that existing prompt-based defenses provide only limited protection, leaving agents vulnerable to our framework. These findings highlight critical vulnerabilities in current web agents and emphasize the urgent need for stronger defense mechanisms. We release code at https://ai-secure.github.io/AdvAgent/.
Lay Summary: Web agents powered by large language models are being used to automate online tasks, but they can be vulnerable to manipulation. We introduce AdvAgent, a method that learns to probe these agents for weaknesses using only their output—no internal access required. Our system reliably uncovers security flaws in advanced agents and shows that current defenses provide limited protection. This work highlights the need for stronger safeguards as AI agents take on more complex, real-world responsibilities.
Primary Area: Deep Learning->Robustness
Keywords: Web Agents, Large Language Models, Prompt Injection Attack
Flagged For Ethics Review: true
Submission Number: 15419
Loading