Keywords: LLM, Adversarial Attacks, Black Box, Jailbreak
TL;DR: In this paper we introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible.
Abstract: We introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that—when combined with a user’s query—disrupts the attacked model’s alignment, resulting in unintended and potentially harmful outputs. To our knowledge this is the first automated universal black box jailbreak attack.
Submission Number: 1
Loading