Benchmarking Prompt-Injection Attacks on Tool-Integrated LLM Agents with Externally Stored Personal Data
Keywords: LLM, Prompt Injection, Privacy, AI Agent
TL;DR: We extend InjecAgent’s model to cover externally stored personal data, measure actual leakage in multi-step tasks, and find that while attacks succeed at notable rates, existing defenses substantially reduce leakage.
Abstract: Tool-integrated agents often access users’ externally stored personal data to complete tasks, creating new vectors for privacy leakage. We study indirect prompt-injection attacks that exfiltrate such data at inference time and propose a data-flow–aware threat model requiring actual leakage, rather than mere task hijacking, to count as success. We (i) extend InjecAgent's threat model to include externally stored personal data and actual leakage measurement; (ii) integrate our threat model into AgentDojo's Banking suite, extend its user tasks from 16 to 48 across nine service categories by adding 11 new tools; (iii) evaluate six LLMs and four defense strategies; and (iv) (ii) analyze various factors affecting leakage. On the original 16-task suite, most models reach $\approx$20\% targeted attack success rates (ASR), with Llama-4 17B peaking at 40\%; on the expanded 48-task suite, ASR averages 11–15\%. For GPT-4o, task utility drops by 12–22\% under attack. Exfiltration of high-sensitive fields alone is less common, but risk rises sharply when combined with one or two less-sensitive fields, specially when injections are semantically aligned with the original task. Some defenses eliminate leakage on the 16-task suite and can reduce ASR to $\approx$1\% on the expanded suite, often with utility trade-offs. These findings underscore the importance of data-flow–aware evaluation for developing agents resilient to inference-time privacy leakage.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19829
Loading