Keywords: Agent Safety, Indirect Promp Injection Attacks
TL;DR: TrojanTools is an adaptive indirect prompt-injection (IPI) attack that auto-generates transferable prompts and selects stealthy tools, boosting attack success by 2.13× and reducing utility by 1.78× while remaining robust to recent defenses.
Abstract: The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution in daily applications. However, this advancement introduces critical security vulnerabilities, particularly indirect prompt injection (IPI) attacks. Existing attack methods are limited by their reliance on static patterns and evaluation on simple language models, failing to address the fast-evolving nature of modern AI agents. We introduce TrojanTools, a novel adaptive IPI attack framework that selects stealthier attack tools and generates adaptive attack prompts to create a rigorous security evaluation environment. Our approach comprises two key components: (1) Adaptive Attack Strategy Construction, which develops transferable adversarial strategies for prompt optimization, and (2) Attack Enhancement, which identifies stealthy tools capable of circumventing task-relevance defenses. Comprehensive experimental evaluation shows that TrojanTools achieves a 2.13× improvement in attack success rate while degrading system utility by a factor of 1.78. Notably, the framework maintains its effectiveness even against state-of-the-art defense mechanisms. Our method advances the understanding of IPI attacks and provides a useful reference for future research.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 3620
Loading