Open Sesame! Universal Black-Box Jailbreaking of Large Language Models

Raz Lapid; Ron Langberg; Moshe Sipper

Open Sesame! Universal Black-Box Jailbreaking of Large Language Models

Raz Lapid, Ron Langberg, Moshe Sipper

Published: 04 Mar 2024, Last Modified: 14 Apr 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Adversarial Attacks, Black Box, Jailbreak

TL;DR: In this paper we introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible.

Abstract: We introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that—when combined with a user’s query—disrupts the attacked model’s alignment, resulting in unintended and potentially harmful outputs. To our knowledge this is the first automated universal black box jailbreak attack.

Submission Number: 1

Loading