SPIN: Self-Supervised Prompt INjection

Leon Zhou; Junfeng Yang; Chengzhi Mao

SPIN: Self-Supervised Prompt INjection

Leon Zhou, Junfeng Yang, Chengzhi Mao

13 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Safety, Alignment, Defense, Adversarial Attack, Inference, LLM, NLP

TL;DR: Inference time defense method to guard Large Language Models against adversarial attacks

Abstract: Large Language Models (LLMs) are increasingly used in a variety of important applications, yet their safety and reliability remain major concerns. Various adversarial and jailbreak attacks have been proposed to bypass the safety alignment and cause the model to produce harmful responses. We introduce Defensive Self-supervised Prompt INjection (D-SPIN) which can detect and reverse these various attacks on LLMs. Just by injecting an adaptive defense prompt at inference-time, our method is simple, effective, and compatible with existing safety-aligned models. Our benchmarks demonstrate that our system can reduce the attack success rate by up to 87.9\%, while maintaining the performance on benign user requests. In addition, we discuss the situation of an adaptive attacker and show that our method is still resilient against attackers who are aware of our defense.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 577

Loading