Perturbation distillation and backdoor feature induction for universal defense in deep vision models

Dongyang Zeng, Yaping Liu

Published: 31 Mar 2026, Last Modified: 06 May 2026OpenReview Archive Direct UploadEveryoneCC BY-NC-ND 4.0

Abstract: Backdoor attacks embed triggers into training data, causing ostensibly well-trained deep neural networks to misclassify inputs containing those triggers. We propose PDI, a novel universal defense framework. It features two key components: adaptive backdoor feature induction, which refines potential triggers from a small clean dataset, and perturbation distillation, which disrupts the model’s reliance on backdoor features by injecting controlled shifts into its probability distributions. By altering predicted logits relative to the original backdoor model, our method dismantles malicious feature pathways while preserving classification performance on clean samples. Across diverse datasets, model architectures, and attack strategies, the PDI achieves state-of-the-art defense accuracy, effectively neutralizing backdoor triggers and retaining normal inference capabilities. Unlike many existing countermeasures, our method does not require explicit knowledge of trigger shapes or attack labels. Thus, it offers a robust, generalizable solution for safeguarding deep neural network models against evolving backdoor threats.