Keywords: backdoor attacks, deep learning security, pre-trained models
Abstract: Pre-trained language models (e.g, BERT, GPT-3) have revolutionized the NLP research and fine-tuning becomes the indispensable step of downstream adaptation. However, the covert attack is the emerging threat to the pre-train-then-fine tuning learning paradigm. The backdoor attack is a typical challenge, which the victim model fails on the trigger-activated samples while behaves normally on others. These backdoors could survive the cascading fine-tuning stage, which continually posing the application of pre-trained models. In this paper, we proposed a Gradient Broadcast Adaptation (GBA) method, prevent the model from controlled producing outputs in a trigger-anchor-free manner. We design the prompt-based tuning, flexibly accessing the rare tokens while providing a fair measure of distance in word embedding space. The gradient broadcast alleviates lazy updating of potential triggers and purges the underlying abnormal weights. The GBA defense method is evaluated over five text-classification tasks against three state-of-the-art backdoor attacks. We find our method can cover nearly 100% embedded backdoor with negligible performance loss on clean data.
One-sentence Summary: We propose a novel adaptation method for pre-trained language models to defend against backdoor attacks.
Supplementary Material: zip
11 Replies
Loading