Jailbreaking Language Models at Scale via Persona Modulation

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: societal considerations including fairness, safety, privacy
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Language Model, Prompt Engineering, Model Evaluations, Red-Teaming, Persona Modulation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Despite significant efforts to align large language models to produce harmless responses, their safety mechanisms are still vulnerable to prompts that elicit undesirable behaviour: jailbreaks. In this work, we investigate persona modulation as a black-box jailbreak that steers the target model to take on particular personalities (personas) that are more likely to comply with harmful instructions. We show that persona modulation can be automated to exploit this vulnerability at scale. We achieve this by using a novel jailbreak prompt that gets a language model to generate jailbreak prompts for arbitrary topics rather than manually crafting a jailbreak prompt for each persona. Persona modulation leads to high attack success rates against GPT-4, and the prompts are transferable to other state-of-the-art models such as Claude 2 and Vicuna. Our work expands the attack surface for misuse and highlights new vulnerabilities in large language models.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2149
Loading