To Distill or Not to Distill: Knowledge Transfer Undermines Safety of LLMs

ICLR 2026 Conference Submission22374 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge Distillation, Safety evaluations, Post-training Robustness
TL;DR: Knowledge distillation yields less safer modes than supervised fine-tuning.
Abstract: Training smaller LLMs often relies on fine-tuning with high-quality data or distilling knowledge from a larger teacher model. Fine-tuning is known to improve utility but reduces safety even on harmless data. In contrast, the safety implications of distillation are not well studied. In this study, we systematically evaluate different hard and soft label distillation methods across tasks such as machine translation, arithmetic reasoning and medical instruction following. We then probe these models on safety dimensions covering jailbreaks, faithfulness and toxicity. Our results show that logit-based soft label distillation produces highly capable models but negatively impacts their safety, with a significantly greater impact (up to 50%) compared to fine-tuning. Post-hoc mechanistic analysis reveals greater token-level uncertainty during safety evaluations and sporadic semantic drift patterns between teacher and student models, which better explains this amplified effect. As distillation methods continue to improve, our findings show the need to examine their safety consequences.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22374
Loading