Shield and Spear: Jailbreaking Aligned LLMs with Generative Prompting

Anonymous

Shield and Spear: Jailbreaking Aligned LLMs with Generative Prompting

Anonymous

16 Oct 2023ACL ARR 2023 October Blind SubmissionReaders: Everyone

Abstract: Large Language Models (LLMs) have demonstrated remarkable text generation and logical reasoning capabilities. However, hackers may exploit these capabilities to generate harmful content despite security measures implemented by developers. This unauthorized usage is metaphorically called "jailbreaking", as attackers aim to escape the secure restrictions ("jail") set by developers. To promote the security defenses for LLMs, this paper introduces a novel automated jailbreaking approach. We start by having LLMs generate relevant malicious settings based on the content of violation questions. Then, we integrate the settings with the questions to trigger LLM jailbreaking responses. We conducted experiments on various aligned LLMs, such as Vicuna, Llama2, ChatGPT, and GPT-4. For the testing of 70 violation questions across 7 categories, our method achieved a success rate of 90% even against the most robust GPT-4 model. The experimental results validate the effectiveness of our method and further encourage consideration of the relationship between LLM's capabilities and security.

Paper Type: long

Research Area: Ethics, Bias, and Fairness

Contribution Types: NLP engineering experiment

Languages Studied: English

0 Replies

Loading