Removing Backdoors in Pre-trained Models by Regularized Continual Pre-trainingDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Abstract: Large-scale pre-trained models (PTMs) have become the cornerstones of deep learning. Trained on massive data, general-purpose PTMs allow quick adaptation to a broad range of downstream tasks with superior performance. However, recent researches reveal that PTMs are vulnerable to backdoor attacks even before being fine-tuned on downstream tasks. By associating specific triggers with pre-defined embeddings, the attackers are capable of implanting transferable task-agnostic backdoors in PTMs, and controlling model outputs on any downstream task at inference time. As a result, all downstream applications can be highly risky after the backdoored PTMs are released and deployed. Given such an emergent threat, it is essential to defend PTMs against backdoor attacks and thus build reliable AI systems. Although there are a series of works aiming to erase backdoors on downstream models, as far as we know, no defenses against PTMs have been proposed. Worse still, existing backdoor-repairing defenses require task-specific knowledge (i.e., some clean downstream data), making them unsuitable for backdoored PTMs. To this end, we propose the first task-irrelevant backdoor removal method for PTMs. Motivated by the sparse activation phenomenon, we design a simple and effective backdoor eraser by continually pre-training the backdoored PTMs with a regularization term, guiding the models to "forget'' backdoors. Our method only needs a few auxiliary task-irrelevant data, e.g., unlabelled plain texts, and thus is practical in typical applications. We conduct extensive experiments across modalities (vision and language) and architectures (CNNs and Transformers) on pre-trained VGG, ViT, BERT and CLIP models. The results show that our method can effectively remove backdoors and preserve benign functionalities in PTMs.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
Supplementary Material: zip
25 Replies

Loading