PFT: Enhancing Prompt Injection Robustness via Position-Enhanced Finetuning

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models (LLMs), Safety, Instruction Following, Adversarial Attacks, Position Encoding
Abstract: Large Language Models (LLMs) are widely adopted in closed-domain applications, where differentiating between system instructions and user input is crucial to prevent unintended malicious actions. However, instruction-following LLMs often blindly follow instructions in user inputs, opening up the risk of prompt injection attacks. This paper investigates whether Supervised Fine-Tuning (SFT) can teach LLMs to strictly distinguish system instructions from user input. Our study reveals a key weakness: SFT-tuned models follow system instructions reliably only when the key instruction is placed immediately after the initial tokens. We find that the proximity of the key instruction to the initial tokens significantly influences the model's ability to execute the intended task, and consequently, its susceptibility to prompt injection attacks.To address this issue, we propose PFT, a novel position-enhanced fine-tuning approach that leverages position IDs to more effectively distinguish between system and user tokens. The experimental results demonstrate that PFT improves the robustness of SFT-tuned models against prompt injection attacks, even when the key instruction is placed arbitrarily in the system prompt, without compromising performance. Our work sheds light on the importance of prompt format in enhancing the security of LLMs and offers a practical solution to improve their robustness.
Supplementary Material: pdf
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7526
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview