Forget-Me-Not: Making Backdoor Hard to be Forgotten in Fine-tuning

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: societal considerations including fairness, safety, privacy
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: backdoor attack, fine-tuning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on
TL;DR: We propose a backdoor training design that can make the backdoor persist under different fine-tuning processes.
Abstract: Backdoor attacks are training time attacks that fool deep neural networks (DNNs) into misclassifying inputs containing a specific trigger, thus representing serious security risks. However, due to catastrophic forgetting, the backdoor inside the poisoned models can be gradually removed under advanced finetuning methods. It reduces the practicality of backdoor attacks since the pretrained models often undergo extra finetuning instead of being used as is, and the attacks gradually lose their robustness given various finetuning-based backdoor defenses. Particularly, recent work reveals that finetuning with a cyclical learning rate scheme can effectively mitigate almost all backdoor attacks. In this paper, we propose a new mechanism for developing backdoor models that significantly strengthens the durability of the generated backdoor. The key idea in this design is to coach the backdoor to become more robust by exposing it to a wider range of learning rates and clean-data-only training epochs. The backdoor models developed with our mechanism can bypass finetuning-based defenses and maintain the backdoor effect even under long and sophisticated finetuning processes. In addition, the backdoor in our backdoored models can persist even if the whole model is finetuned end-to-end with another task, causing a notable accuracy drop when the trigger is present. We demonstrate the effectiveness of our technique through empirical evaluation with various backdoor triggers on three popular benchmarks, including CIFAR-10, CelebA, and ImageNet-10.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7351