AdvPrefix: An Objective for Nuanced LLM Jailbreaks

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-SA 4.0
Keywords: LLM, red-teaming, safety, safety alignment, jailbreak, adversarial attack, adversarial robustness, discrete optimization, controllable generation
TL;DR: A new LLM jailbreak objective that enables more nuanced control over jailbroken responses, exploits undergeneralization of safety alignment, and improves success rates of existing jailbreaks from 14% to 80%.
Abstract: Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix ``Sure, here is (harmful request)''. While straightforward, this objective has two limitations: limited control over model behaviors, yielding incomplete or unrealistic jailbroken responses, and a rigid format that hinders optimization. We introduce AdvPrefix, a plug-and-play prefix-forcing objective that selects one or more model-dependent prefixes by combining two criteria: high prefilling attack success rates and low negative log-likelihood. AdvPrefix integrates seamlessly into existing jailbreak attacks to mitigate the previous limitations for free. For example, replacing GCG's default prefixes on Llama-3 improves nuanced attack success rates from 14\% to 80\%, revealing that current safety alignment fails to generalize to new prefixes. Code and selected prefixes are released.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 24016
Loading