Shield and Spear: Jailbreaking Aligned LLMs with Generative PromptingDownload PDF

Anonymous

16 Oct 2023ACL ARR 2023 October Blind SubmissionReaders: Everyone
Abstract: Large Language Models (LLMs) have demonstrated remarkable text generation and logical reasoning capabilities. However, hackers may exploit these capabilities to generate harmful content despite security measures implemented by developers. This unauthorized usage is metaphorically called "jailbreaking", as attackers aim to escape the secure restrictions ("jail") set by developers. To promote the security defenses for LLMs, this paper introduces a novel automated jailbreaking approach. We start by having LLMs generate relevant malicious settings based on the content of violation questions. Then, we integrate the settings with the questions to trigger LLM jailbreaking responses. We conducted experiments on various aligned LLMs, such as Vicuna, Llama2, ChatGPT, and GPT-4. For the testing of 70 violation questions across 7 categories, our method achieved a success rate of 90% even against the most robust GPT-4 model. The experimental results validate the effectiveness of our method and further encourage consideration of the relationship between LLM's capabilities and security.
Paper Type: long
Research Area: Ethics, Bias, and Fairness
Contribution Types: NLP engineering experiment
Languages Studied: English
0 Replies

Loading