everyone
since 13 Oct 2023">EveryoneRevisionsBibTeX
Despite significant efforts to align large language models to produce harmless responses, their safety mechanisms are still vulnerable to prompts that elicit undesirable behaviour: jailbreaks. In this work, we investigate persona modulation as a black-box jailbreak that steers the target model to take on particular personalities (personas) that are more likely to comply with harmful instructions. We show that persona modulation can be automated to exploit this vulnerability at scale. We achieve this by using a novel jailbreak prompt that gets a language model to generate jailbreak prompts for arbitrary topics rather than manually crafting a jailbreak prompt for each persona. Persona modulation leads to high attack success rates against GPT-4, and the prompts are transferable to other state-of-the-art models such as Claude 2 and Vicuna. Our work expands the attack surface for misuse and highlights new vulnerabilities in large language models.