Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance Model Safety Guardrail to Potential Attacks

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance Model Safety Guardrail to Potential Attacks

ACL ARR 2025 February Submission316 Authors

06 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract:

Despite efforts taken to enhance the ability of large language models (LLMs) to refuse to answer different malicious instructions, widely used LLMs are still susceptible to jailbreaking attacks, wherein an attack tool generates instructions that have a different distribution from used safety alignment corpus. When new jailbreaking attacks occur, LLMs can hardly recognize the malicious intent behind the user instructions. This limitation highlights a crucial challenge: the misalignment between the training corpus used for safety alignment and the evolving, diverse nature of real-world malicious instructions. As a result, developers are often "one step slower" than attack explorers, forced into reactive cycles of patching vulnerabilities after they are exploited. Addressing this issue requires not only improving the model's ability to generalize to unseen malicious instructions on the surface but also filling the distributional gap between the safety training corpus and real-world attacks. To tackle this challenge, we propose IMAGINE, a novel synthesis framework that leverages embedding space distribution analysis to generate jailbreak-mimicking instructions. This approach effectively fills the distributional gap between authentic jailbreaking patterns and safety alignment corpora. IMAGINE follows an iterative optimization process that dynamically evolves text generation distributions across iterations, thereby augmenting the coverage of safety alignment data distributions through synthesized examples. Based on the safety-aligned corpus enhanced through IMAGINE, our framework demonstrates significant decreases of attack success rate on Qwen2.5, Llama3.1, and Llama3.2 without compromising their utility.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: security and privacy, red teaming, ethical considerations in NLP applications

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English, Chinese

Submission Number: 316

Loading