Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

Yuhang Wang; Zhenxing Niu

Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

Yuhang Wang, Zhenxing Niu

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Jailbreak attacks, Jailbreak defense

TL;DR: we propose a novel smoothing-based defense method, termed Disrupt-and-Rectify Smoothing (DR-Smoothing)

Abstract: This paper proposes a guaranteed defense method for large language models (LLMs) to safeguard against jailbreaking attacks. Drawing inspiration from the denoised-smoothing approach in the adversarial defense domain, we propose a novel smoothing-based defense method, termed Disrupt-and-Rectify Smoothing (DR-Smoothing). Specifically, we integrate a two-stage prompt processing scheme—first disrupting the input prompt, then rectifying it—into the conventional smoothing defense framework. This \emph{disrupt-and-rectify} approach improves upon previous disrupt-only approaches by restoring out-of-distribution disrupted prompts to an in-distribution form, thereby reducing the risk of unpredictable LLM behavior. In addition, this two-stage scheme offers a distinct advantage in striking a balance between \emph{harmlessness} and \emph{helpfulness} in jailbreaking defense. Notably, we present a theoretical analysis for \emph{generic} smoothing framework, offering a tight bound for the defense success probability and the requirements on the disruption strength. Our approach can defend against both token-level and prompt-level jailbreaking attacks, under both \emph{established} and \emph{adaptive} attacking scenarios. Extensive experiments demonstrate that our approach surpasses current state-of-the-art defense methods in terms of both harmlessness and helpfulness.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 7345

Loading