Abstract: Pre-trained models have been employed by substantial downstream tasks, achieving remarkable achievements in transfer learning scenarios. However, poisoning the training samples guides the target model to make misclassification in the inference phase, backdoor attacks against pre-trained models represents a new security threat. In this paper, we propose two patch-based backdoors detection and mitigation methods via feature masking. Our approaches are motivated by the observation that, patch-based triggers induce abnormal feature distribution at the intermediate layer. By exploiting the feature importance extraction method and gradient-based threshold method, the backdoored samples can be detected and the abnormal feature values can be backward linked to the trigger position. Hence, masking the features within the trigger posed achieves the correct labels for those backdoored samples. Finally, we employ the unlearning technique to dramatically mitigate the negative effect of the backdoor attacks. The extensive experimental results show that our approaches perform better in defense effectiveness and model inference accuracy on clean examples than the state-of-the-art method.
Loading