Stronger Universal and Transfer Attacks by Suppressing Refusals

Published: 12 Oct 2024, Last Modified: 10 Dec 2024SafeGenAi PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Jailbreak Attack, Universal and Transfer Attack, Safety Feature, Refusal Representation, Adversarial Attack
Abstract: Making large language models (LLMs) safe for mass deployment is a complex and ongoing challenge. Efforts have focused on aligning models to human prefer- ences (RLHF) in order to prevent malicious uses, essentially embedding a “safety feature” into the model’s parameters. The Greedy Coordinate Gradient (GCG) algorithm (Zou et al., 2023b) emerges as one of the most popular automated jail- breaks, an attack that circumvents this safety training. So far, it is believed that these optimization-based attacks (unlike hand-crafted ones) are sample-specific. To make the automated jailbreak universal and transferable, they require incorporating multiple samples and models into the objective function. Contrary to this belief, we find that the adversarial prompts discovered by such optimizers are inherently prompt-universal and transferable, even when optimized on a single model and a single harmful request. To further amplify this phenomenon, we introduce IRIS, a new objective to these optimizers to explicitly deactivate the safety feature to create an even stronger universal and transferable attack. Without requiring a large number of queries or accessing output token probabilities, our transfer attack, optimized on Llama-3, achieves a 25% success rate against the state-of-the-art Circuit Breaker defense (Zou et al., 2024), compared to 2.5% by white-box GCG. Crucially, our universal attack method also attains state-of-the-art test-set transfer rates on frontier models: GPT-3.5-Turbo (90%), GPT-4o-mini (86%), GPT-4o (76%), o1-mini (54%), and o1-preview (48%).
Submission Number: 192
Loading