Keywords: Large Language Models, Robustness and Safety, Jailbreak Defense, Adversarial Training
Abstract: Large language models (LLMs) remain susceptible to jailbreak attacks despite widespread safety alignment efforts. Existing adversarial training (AT) approaches mitigate these attacks yet typically require expensive gradient-based perturbations and substantial auxiliary datasets. In this work, we propose \textbf{Linear Adversarial Jailbreak Defense (LinAJD)}, a gradient-free framework that exploits the linear separability of harmful and safe prompts in embedding space. LinAJD provides a highly efficient framework for adversarial training, delivering up to $4\times$ faster forward-backward pass speed and a $60\times$ speedup in total training time, while reducing data usage by over $90\%$. Empirical results on multiple open-source models show that LinAJD achieves state-of-the-art robustness against a wide range of jailbreak attacks, with fine-tuned LLaMA-2-7B model even reducing the success rate of a recent white-box attack to $0\%$, and demonstrates excellent scalability to larger models like Qwen2.5-14B. At the same time, LinAJD maintains a favorable robustness-utility tradeoff, as general performance shows only minor degrade without reliance on extra utility datasets. We further analyze the effects of data quality, safety alignment, and domain shifts, offering deeper insight into LinAJD’s robustness and generalizability. Our code is available at https://anonymous.4open.science/status/LinAJD-anon-4BBE.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 6531
Loading