Intention Analysis Makes LLMs A Good Jailbreak Defender

ACL ARR 2024 June Submission4819 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Aligning large language models (LLMs) with human values, particularly in the face of complex and stealthy jailbreak attacks, presents a formidable challenge. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($\mathbb{IA}$). The principle behind $\mathbb{IA}$ is to trigger LLMs' inherent self-correct and improve ability through a two-stage process: 1) analyzing essential intention of the user input, and 2) providing final policy-aligned response based on the first round conversation. Notably, $\mathbb{IA}$ is an inference-only method, thus could enhance the safety of LLMs without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across ChatGLM, Llama2, Llama3, Vicuna, MPT, DeepSeek, and GPT-3.5 show that $\mathbb{IA}$ could consistently and significantly reduce the harmfulness in responses (averagely -48.2\% attack success rate) without compromising the general helpfulness. Encouragingly, with the help of our $\mathbb{IA}$, Vicuna-7B even outperforms GPT-3.5 in terms of attack success rate. We empirically demonstrate that, to some extent, $\mathbb{IA}$ is robust to errors in generated intentions. Further analyses present some insights into how $\mathbb{IA}$ mechanism works and suggest two directions to improve its performance. The code will be released.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Ethics, Bias, and Fairness, Generation
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 4819
Loading