Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
Keywords: Knowledge Distillation, LLM Safety, Multilingual Safety, Jailbreak Prevention, Adversarial Robustness, Catastrophic Forgetting, Fine-Tuning, LoRA, Safety Alignment, Safety-Reasoning Trade-off
TLDR: Distilling safety refusals from a proprietary teacher into open-source multilingual models via response-based knowledge distillation unexpectedly worsens jailbreak robustness, revealing hidden risks in knowledge distillation for safety alignment.
Abstract: Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model ($\texttt{OpenAI o1-mini}$) with Low-Rank Adaptation (LoRA) into three open-source student models: $\texttt{Meta-Llama-3-8B-Instruct}$, $\texttt{Gemma-2-2B-IT}$, and $\texttt{Qwen3-8B}$, using ~28,000 multilingual jailbreak prompts from $\texttt{XSafety}$ via response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the $\texttt{MultiJail}$ benchmark reveals a counterintuitive behavior: fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.
Submission Number: 23
Loading