Enhancing Jailbreak Defense with Diverse Perturbation Combinations

03 Nov 2025 (modified: 01 Dec 2025)IEEE MiTA 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, Jailbreak prompts, Perturbation strategies, Diverse Perturbation Combinations
TL;DR: We propose a simple yet effective defense method called Diverse Perturbation Combinations (DPC) to robustly protect LLMs against various jailbreak attacks.
Abstract: While aligning large language models (LLMs) with human preferences can prevent undesired outputs, recent research indicates that jailbreak prompts can easily bypass such safeguards, resulting in the generation of prohibited content. In response, perturbation strategies are explored to damage attacks. However, some methods entail iterative checks of prompt subsequences, introducing uncontrollable complexity. Moreover, certain defenses relying on random perturbation are inadequate in disrupting malicious requests, leading to the defense being bypassed. Others depending on a single type of perturbation may incorrectly respond to benign requests while countering attacks. To address these issues, we propose a simple but effective defense strategy called Diverse Perturbation Combinations (DPC). It refines existing perturbations and achieves enough interference on attacks through targeted perturbation, enabling defense against multiple jailbreak attacks. Moreover, by strategically combining different operations, DPC ensures stable detection of benign or harmful requests. We conduct exhaustive theoretical analysis, providing effectiveness guarantees for the defense. Experimental validation on datasets featuring diverse jailbreak prompts across multiple closed-source and open-source LLMs further confirms the empirical effectiveness of our approach.
Submission Number: 8
Loading