AgentXploit: End-to-End Red-Teaming for AI Agents Powdered by Multi-Agent Systems

13 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: security, agent, prompt injection
Abstract: AI agents, powered by Large Language Model (LLM), are vulnerable to indirect prompt injection attacks, where malicious data from external tools and data sources can manipulate agent behavior. Existing works mainly focus on developing strategies to generate attack prompts under a pre-constructed attack patch. However, the major challenge lies in systematically identifying attack vectors and automatically constructing attack paths from the system entry points to the target components. Without such mechanisms, none of the existing red-teaming approaches can be applied to real-world AI agents for end-to-end attacks. We propose AgentXploit, the first fully automatic, multi-agent framework that systematically identifies and demonstrates these vulnerabilities through white-box analysis of the agent's source code. Our framework operates in two stages. First, an Analyzer agent inspects the target agent’s codebase using dynamic task planning, long-term context management, and semantical code browsing. It generates comprehensive reports on agent workflows, high-risk tools that process untrusted data, sensitive tools that can cause harm, and viable attack paths. Second, an Exploiter agent uses these reports to dynamically craft and execute attacks. It leverages a specialized seed corpus, context-aware injection generation guided by real-time agent feedback, and multi-injection collaboration to reliably trigger malicious behavior. We evaluate AgentXploit on two popular open-source agents, successfully demonstrating a range of attacks. On the AgentDojo benchmark, our Exploiter achieves a 79% attack success rate, outperforming prior state-of-the-art by 27%.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 4881
Loading