Keywords: Large Language Model, Jailbreak attacks, Jailbreak defense
TL;DR: we propose a novel smoothing-based defense method, termed Disrupt-and-Rectify Smoothing (DR-Smoothing)
Abstract: This paper proposes a guaranteed defense method for large language models (LLMs) to safeguard against jailbreaking attacks.
Drawing inspiration from the denoised-smoothing approach in the adversarial defense domain, we propose a novel smoothing-based defense method, termed Disrupt-and-Rectify Smoothing (DR-Smoothing). Specifically, we integrate a two-stage prompt processing scheme—first disrupting the input prompt, then rectifying it—into the conventional smoothing defense framework. This \emph{disrupt-and-rectify} approach improves upon previous disrupt-only approaches by restoring out-of-distribution disrupted prompts to an in-distribution form, thereby reducing the risk of unpredictable LLM behavior. In addition, this two-stage scheme offers a distinct advantage in striking a balance between \emph{harmlessness} and \emph{helpfulness} in jailbreaking defense. Notably, we present a theoretical analysis for \emph{generic} smoothing framework, offering a tight bound for the defense success probability and the requirements on the disruption strength.
Our approach can defend against both token-level and prompt-level jailbreaking attacks, under both \emph{established} and \emph{adaptive} attacking scenarios. Extensive experiments demonstrate that our approach surpasses current state-of-the-art defense methods in terms of both harmlessness and helpfulness.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 7345
Loading