Defending against Indirect Prompt Injection by Instruction Detection

Defending against Indirect Prompt Injection by Instruction Detection

ACL ARR 2025 May Submission1931 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The integration of Large Language Models (LLMs) with external sources is becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a prominent example. However, this integration introduces vulnerabilities of Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in external data can manipulate LLMs into executing unintended or harmful actions. We recognize that IPI attacks fundamentally rely on the presence of instructions embedded within external content, which can alter the behavioral states of LLMs. Can effectively detecting such state changes help us defend against IPI attacks? In this paper, we propose InstructDetector, a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks. Specifically, we demonstrate the hidden states and gradients from intermediate layers provide highly discriminative features for instruction detection. By effectively combining these features, InstructDetector achieves a detection accuracy of 99.60\% in the in-domain setting and 96.90\% in the out-of-domain setting, and reduces the attack success rate to just 0.03\% on the BIPIA benchmark.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: safety and alignment, security and privacy

Contribution Types: Model analysis & interpretability

Languages Studied: English

Keywords: safety, large language model, indirect prompt injection attack

Submission Number: 1931

Loading