Open Sesame! Universal Black-Box Jailbreaking of Large Language Models

Published: 04 Mar 2024, Last Modified: 14 Apr 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Adversarial Attacks, Black Box, Jailbreak
TL;DR: In this paper we introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible.
Abstract: We introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that—when combined with a user’s query—disrupts the attacked model’s alignment, resulting in unintended and potentially harmful outputs. To our knowledge this is the first automated universal black box jailbreak attack.
Submission Number: 1
Loading