SPIN: Self-Supervised Prompt INjection

ICLR 2025 Conference Submission577 Authors

13 Sept 2024 (modified: 28 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Safety, Alignment, Defense, Adversarial Attack, Inference, LLM, NLP
TL;DR: Inference time defense method to guard Large Language Models against adversarial attacks
Abstract: Large Language Models (LLMs) are increasingly used in a variety of important applications, yet their safety and reliability remain major concerns. Various adversarial and jailbreak attacks have been proposed to bypass the safety alignment and cause the model to produce harmful responses. We introduce Defensive Self-supervised Prompt INjection (D-SPIN) which can detect and reverse these various attacks on LLMs. Just by injecting an adaptive defense prompt at inference-time, our method is simple, effective, and compatible with existing safety-aligned models. Our benchmarks demonstrate that our system can reduce the attack success rate by up to 87.9\%, while maintaining the performance on benign user requests. In addition, we discuss the situation of an adaptive attacker and show that our method is still resilient against attackers who are aware of our defense.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 577
Loading