Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

ACL ARR 2024 December Submission843 Authors

15 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: With the development of technology, large language models (LLMs) have dominated the downstream natural language processing (NLP) tasks. However, because of the LLMs' instruction-following abilities and inability to distinguish the instructions in the data content, such as web pages from search engines, the LLMs are vulnerable to prompt injection attacks. These attacks trick the LLMs into deviating from the original input instruction and executing the attackers' target instruction. Recently, various instruction hierarchy defense strategies are proposed to effectively defend against prompt injection attacks via fine-tuning. In this paper, we explore a more vicious attack that even nullify the instruction hierarchy: backdoor-powered prompt injection attacks, where the attackers utilize the backdoor attack for prompt injection attack purposes. Specifically, the attackers poison the supervised fine-tuning samples and insert the backdoor into the model. Once the trigger is activated, the backdoored model executes the injected instruction surrounded by the trigger. For evaluation we construct a benchmark and our experiments demonstrate that backdoor-powered prompt injection attacks are much more harmful than previous prompt injection attacks, nullifying the instruction hierarchy strategies.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Backdoor Attacks, Prompt Injection Attacks, Large Language Models

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 843

Loading