The Surface You Test Is Not the Surface That Breaks

The Surface You Test Is Not the Surface That Breaks

ACL ARR 2026 May Submission17266 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: prompt injection, LLM agents, tool-augmented LLMs, adversarial robustness, adaptive attacks, AI safety, model evaluation, AgentDojo, attack surfaces, security

Abstract: Tool-augmented LLM agents are vulnerable to prompt injection: a third party who controls part of the agent's context can plant instructions that the agent then executes as if they came from the user. Current evaluations report a single attack success rate per model on one channel, the tool output and treat that number as the model's vulnerability. But tool descriptions, which the agent reads at every turn before any tool is called, are themselves an injection surface that the attacker can choose instead. We hold the injection payload byte-identical and deliver it through both surfaces across 13 LLMs from six families and four task suites. The same bytes invert in success rate across models: \textsc{GPT-4.1} is 96\% vulnerable on tool outputs but only 4\% on tool descriptions, while \textsc{Gemini-3-Flash} shows the mirror pattern at 20\% and 98\%. A variance decomposition over 6{,}830 attempts attributes $0\%$ of the variation in attack outcomes to the surface alone, while the model$\times$surface interaction accounts for $16.7\%$. Vulnerability is a property of the pairing, not the channel. The Adaptive Attack Rate, defined as the per-cell maximum over surfaces, exceeds the strongest fixed-surface baseline by $+9.1$ percentage points on average. Standard prompt-level defenses inherit the same blindspot, reducing tool-output ASR to 10--18\% while leaving the description channel above 54\%. Both attack and defense evaluation must report per-surface vulnerability.

Paper Type: Long

Research Area: LLM agents

Research Area Keywords: Ethics, Bias, and Fairness; Resources and Evaluation; Interpretability and Analysis of Models for NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis

Languages Studied: Eng

EMNLP 2026 AI Reviewing Experiment: yes

Submission Number: 17266

Loading