Modeling Normalcy: A Zero-Negative-Shot Defense Against Prompt Injection

19 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Security, Prompt Injection, Instruction Following, Anomaly Detection, Benchmark, Dataset
Abstract: System prompts are the primary mechanism for customizing and constraining large language models (LLMs) to task-specific constraints, yet they are vulnerable to prompt injection. In practice, developers typically have many benign messages for a fixed system prompt, but few or no labeled attacks. Moreover, violations are prompt-specific, necessitating prompt-specific defenses. To address this, we introduce \textsc{TRAP-Bench}, a realistic, diverse, and large-scale benchmark that maps granular constraints to representative system prompts, paired with benign interactions and constraint-specific attacks. Building on \textsc{TRAP-Bench}, we present \textsc{DIAMOND} (\emph{Detecting Injections via Activation MONitoring \& anomaly Detection}), a zero-negative-shot detector that frames prompt injection as anomaly detection in activation space, modeling hidden-state regularities during normal use and flagging deviations at inference time. Evaluated on \textsc{TRAP-Bench}, \textsc{DIAMOND} substantially outperforms a strong baseline on \textbf{88–98\%} of tasks across multiple models and achieves a task-averaged macro F1 of \textbf{0.75–0.83}. This robust performance holds even at low false positive rates, confirming its practical viability. Together, our contributions provide both a realistic testbed for evaluating defenses and a practical, drop-in monitor for deployment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15447
Loading