Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

Published: 02 Mar 2026, Last Modified: 30 Mar 2026Agentic AI in the Wild: From Hallucinations to Reliable Autonomy PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Distillation, LLM, Security, Backdoor
TL;DR: We study the transferability of LLM backdoors under benign distillation.
Abstract: Knowledge distillation is an increasingly popular technique to compress the capabilities of large teacher LMs into memory-efficient student models. However, the use of teacher models from untrusted sources raises potential security risks, as malicious behaviors may inadvertently transfer to the student. In this paper, we investigate the security implications of knowledge distillation from backdoored teacher models, when distilling on benign data. First, we show that existing backdoor attacks largely fail to transfer to student models during distillation. In particular, we find that this is due to existing backdoor attacks selecting trigger tokens that rarely occur in typical contexts. We argue that this underestimates the security risks of knowledge distillation, as users typically distill on popular, publicly-available datasets which can be anticipated by the attacker. Accordingly, we demonstrate that triggers composed of tokens that often occur individually in these datasets provide sufficient signal for the backdoor to transfer during distillation. Crucially, such triggers also preserve the stealthiness of the backdoored teacher. We find that often even under partial information of the dataset, the backdoor can still transfer onto the student. Finally, we study transferability failure cases and potential defenses. We validate our findings across two attack scenarios, jailbreaking and content modulation, and across multiple LLM families and sizes.
Submission Number: 27
Loading