LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Published: 19 Mar 2024, Last Modified: 18 Jun 2024Tiny Papers @ ICLR 2024 PresentEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, adversarial attacks, LLM defense
TL;DR: LLM Self Defense is a highly effective zero-shot approach for shielding users from virtually all harmful LLM generated content, without the need for any modifications to the underlying model or preprocessing, thus simplifying the defense process.
Abstract: Large language models (LLMs) are popular for high-quality text generation but can also produce harmful responses as adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these attacks by having an LLM screen the induced responses, thus not requiring any fine-tuning, input preprocessing, or iterative output generation. Instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an LLM to analyze the text and predict whether it is harmful. Notably, LLM Self Defense succeeds in reducing the attack success rate to virtually 0 against various types of attacks on GPT 3.5 and Llama 2. The code is publicly available at https://github.com/poloclub/llm-self-defense
Submission Number: 115
Loading