Abstract: Deep neural networks (DNNs) are susceptible to backdoor attacks due to their black-box nature and lack of interpretability. Backdoor attacks intend to manipulate the model's prediction when hidden backdoors are activated by predefined triggers. Although considerable progress has been made in backdoor detection and removal at the model deployment stage, an effective defense against backdoor attacks during the training time is still under-explored. In this paper, we propose a novel training-time backdoor defense method called Learning from Distinction (LfD), allowing training a backdoor-free model on the backdoor-poisoned data. LfD uses a low-capacity model as a teacher to guide the learning of a backdoor-free student model via a dynamic weighting strategy. Extensive experiments on CIFAR-10, GTSRB and ImageNet-subset datasets show that LfD significantly reduces attack success rates to 0.67%, 6.14% and 1.42%, respectively, with minimal impact on clean accuracy (less than 1%, 3% and 1%).
Loading