Evaluating the Instruction-Following Robustness of Large Language Models to Prompt InjectionDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following, becoming increasingly crucial across various applications. However, this capability brings with it the risk of prompt injection attacks, where attackers inject instructions into LLMs' input to elicit undesirable actions or content. Understanding the robustness of LLMs against such attacks is vital for their safe implementation. In this work, we establish a benchmark to evaluate the robustness of instruction-following LLMs against prompt injection attacks. Our objective is to determine (1) the extent to which LLMs can be influenced by injected instructions and (2) their ability to differentiate between these injected and original target instructions. Through extensive experiments with leading instruction-following LLMs, we uncover significant vulnerabilities in their robustness to such attacks. Our results indicate that some models are overly tuned to follow any embedded instructions in the prompt, overly focusing on the latter parts of the prompt without fully grasping the entire context. By contrast, models with a better grasp of the context and instruction-following capabilities will potentially be more susceptible to compromise by injected instructions. This underscores the need to shift the focus from merely enhancing LLMs' instruction-following capabilities to improving their overall comprehension of prompts and discernment of instructions that are appropriate to follow. We hope our in-depth analysis offers insights into the underlying causes of these vulnerabilities, aiding in the development of future solutions.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: NLP engineering experiment, Reproduction study
Languages Studied: English
Preprint Status: There is a non-anonymous preprint (URL specified in the next question).
A1: yes
A1 Elaboration For Yes Or No: In Section 6
A2: no
A2 Elaboration For Yes Or No: We can not think of potential risks of our work, as discussed in Section 7.
A3: yes
A3 Elaboration For Yes Or No: In Abstract and Section 1 Introduction
B: yes
B1: no
B1 Elaboration For Yes Or No: We create artifacts.
B2: no
B2 Elaboration For Yes Or No: We use OpenAI's service under the agreement of their terms of use, which can be found at https://openai.com/policies/terms-of-use.
B3: no
B3 Elaboration For Yes Or No: We ensure that the question-answering datasets and instruction datasets are utilized in accordance with their intended purposes, and the artifacts created are in compliance with the original access conditions
B4: yes
B4 Elaboration For Yes Or No: We re-used established existing open-source dataset and the newly generated instructions/questions do not include identity information or offensive content, as discussed in Section 7.
B5: no
B5 Elaboration For Yes Or No: We generated additional questions for existing question-answering datasets.
B6: yes
B6 Elaboration For Yes Or No: In Section 3 and the Appendix
C: yes
C1: yes
C1 Elaboration For Yes Or No: In Section 4.1 and Appendix A.1.
C2: yes
C2 Elaboration For Yes Or No: In Section 4.1 and Appendix A.1
C3: yes
C3 Elaboration For Yes Or No: In Appendix A.1
C4: no
D: yes
D1: no
D1 Elaboration For Yes Or No: We report the instructions for annotators in Section 4.5
D2: yes
D2 Elaboration For Yes Or No: In Section 4.5
D3: yes
D3 Elaboration For Yes Or No: Annotators are informed that the data will be utilized for research purposes.
D4: no
D4 Elaboration For Yes Or No: We can not think of ethics issues.
D5: yes
D5 Elaboration For Yes Or No: In Section 4.5
E: yes
E1: no
E1 Elaboration For Yes Or No: We only use AI assistants for correcting grammar errors in the paper
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview