Keywords: Backdoor Defense, Backdoor Attack
TL;DR: We propose CMP, where CMP is the first backdoor defense method to be: 1. Truly Data-Free without threshold. 2. The only algorithm which works in Practical Scale.
Abstract: The widespread adoption of pre-trained neural networks from unverified sources has heightened concerns about backdoor attacks. These attacks cause networks to misbehave on inputs containing specific triggers while maintaining normal performance otherwise. Existing methods typically rely on pruning, operating under the assumption that backdoors are encoded in a small set of specific neurons. This approach, however, is ineffective on large-scale models where phenomena like polysemanticity make isolating malicious neurons without harming model performance difficult. Furthermore, pruning-based methods are impractical as they require unavailable calibration data to determine critical thresholds, limiting their deployment in real-world scenarios. We introduce Calibration-free Model Purification (CMP), a novel, completely data-free defense that avoids pruning entirely. CMP leverages a self-distillation framework guided by our discovery of a systematic "prediction skew" as the fundamental mechanism for backdoor transfer during knowledge distillation. It employs a dual-filtering system that counteracts this skew, preventing the student model from inheriting the teacher's malicious behavior. On the challenging ImageNet dataset, CMP reduces attack success rates to near-zero across diverse attacks while preserving clean accuracy, outperforming existing methods. Our work presents the first scalable, threshold-free defense, offering a practical solution for real-world AI security.
Supplementary Material: pdf
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 16307
Loading