Abstract: This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when the perturbation is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing a left-side perturbation merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task and then develop 4 variants to guide LLMs to understand and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves $\sim$78.97\% attack success rate across 8 LLMs on average and $\sim$98\% bypass rate against 5 guard models on average.
Lay Summary: - We reveal LLMs' understanding mechanism and find that left-side perturbation weaken their understanding ability on text, keeping the attack universally applicable.
- We disguise the harmful request by adding left-side perturbation iteratively based on the request itself and generalizing it to four flipping modes, keeping attack stealthy.
- We design a flipping guidance module to teach LLMs to recover, understand, and execute the disguised prompt, jailbreaking black-box LLMs within one query easily.
- We conduct extensive experiments to demonstrate the superiority and efficiency of FlipAttack.
Link To Code: https://github.com/yueliu1999/FlipAttack
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Model, Jailbreak Attack, AI Safety
Submission Number: 7156
Loading