Keywords: Large Language Model, Jailbreak attack, Few-shot, text gradient, genetic algorithm
Abstract: This paper studies the problem of few-shot large language model (LLM) jailbreak, which aims to trigger unsafe outputs of LLMs using only a handful of adversarial examples. However, the effectiveness of the current few-shot jailbreak attacks is limited by the challenge of systematically selecting the most potent instances, with existing methods often resorting to inefficient manual or random selection. In this paper, we propose a novel approach named Automatic Instance Selection with Genetic Updating (ACCEPT) for few-shot LLM jailbreak. The core of our ACCEPT is to utilize textual gradient and fitness scores to guide the optimization process automatically. In particular, our ACCEPT designs a loss objective prioritizing successful jailbreaks, which can further guide the selection of instances via textual gradient. Furthermore, we construct a pool with meaningless marks, and consider the injection operators as chromosomes following the genetic algorithm. A fitness function is then defined in jailbreak scenarios, which helps the iterations across generations for proper prompts. Extensive experiments across several benchmark datasets can validate the effectiveness of the proposed ACCEPT in comparison with extensive baselines.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12495
Loading