LinAJD: Scalable Gradient-Free Jailbreak Defense via Linearly Separable Embeddings

Hengyu Zhao; Qianqian Xu; Peisong Wen; Yudong Wang; Jinyan Liu

LinAJD: Scalable Gradient-Free Jailbreak Defense via Linearly Separable Embeddings

Hengyu Zhao, Qianqian Xu, Peisong Wen, Yudong Wang, Jinyan Liu

16 Sept 2025 (modified: 07 Mar 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Robustness and Safety, Jailbreak Defense, Adversarial Training

Abstract: Large language models (LLMs) remain susceptible to jailbreak attacks despite widespread safety alignment efforts. Existing adversarial training (AT) approaches mitigate these attacks yet typically require expensive gradient-based perturbations and substantial auxiliary datasets. In this work, we propose \textbf{Linear Adversarial Jailbreak Defense (LinAJD)}, a gradient-free framework that exploits the linear separability of harmful and safe prompts in embedding space. LinAJD provides a highly efficient framework for adversarial training, delivering up to $4\times$ faster forward-backward pass speed and a $60\times$ speedup in total training time, while reducing data usage by over $90\%$. Empirical results on multiple open-source models show that LinAJD achieves state-of-the-art robustness against a wide range of jailbreak attacks, with fine-tuned LLaMA-2-7B model even reducing the success rate of a recent white-box attack to $0\%$, and demonstrates excellent scalability to larger models like Qwen2.5-14B. At the same time, LinAJD maintains a favorable robustness-utility tradeoff, as general performance shows only minor degrade without reliance on extra utility datasets. We further analyze the effects of data quality, safety alignment, and domain shifts, offering deeper insight into LinAJD’s robustness and generalizability. Our code is available at https://anonymous.4open.science/status/LinAJD-anon-4BBE.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 6531

Loading