everyone
since 09 May 2025">EveryoneRevisionsBibTeXCC BY 4.0
Despite efforts taken to enhance the ability of large language models (LLMs) to refuse to answer different malicious instructions, widely used LLMs are still susceptible to jailbreaking attacks, wherein an attack tool generates instructions that have a different distribution from used safety alignment corpus. When new jailbreaking attacks occur, LLMs can hardly recognize the malicious intent behind the user instructions. This limitation highlights a crucial challenge: the misalignment between the training corpus used for safety alignment and the evolving, diverse nature of real-world malicious instructions. As a result, developers are often "one step slower" than attack explorers, forced into reactive cycles of patching vulnerabilities after they are exploited. Addressing this issue requires not only improving the model's ability to generalize to unseen malicious instructions on the surface but also filling the distributional gap between the safety training corpus and real-world attacks. To tackle this challenge, we propose IMAGINE, a novel synthesis framework that leverages embedding space distribution analysis to generate jailbreak-mimicking instructions. This approach effectively fills the distributional gap between authentic jailbreaking patterns and safety alignment corpora. IMAGINE follows an iterative optimization process that dynamically evolves text generation distributions across iterations, thereby augmenting the coverage of safety alignment data distributions through synthesized examples. Based on the safety-aligned corpus enhanced through IMAGINE, our framework demonstrates significant decreases of attack success rate on Qwen2.5, Llama3.1, and Llama3.2 without compromising their utility.