Piccolo: Exposing Complex Backdoors in NLP Transformer ModelsDownload PDFOpen Website

2022 (modified: 05 Nov 2022)IEEE Symposium on Security and Privacy 2022Readers: Everyone
Abstract: Backdoors can be injected to NLP models such that they misbehave when the trigger words or sentences appear in an input sample. Detecting such backdoors given only a subject model and a small number of benign samples is very challenging because of the unique nature of NLP applications, such as the discontinuity of pipeline and the large search space. Existing techniques work well for backdoors with simple triggers such as single character/word triggers but become less effective when triggers and models become complex (e.g., transformer models). We propose a new backdoor scanning technique. It transforms a subject model to an equivalent but differentiable form. It then uses optimization to invert a distribution of words denoting their likelihood in the trigger. It leverages a novel word discriminativity analysis to determine if the subject model is particularly discriminative for the presence of likely trigger words. Our evaluation on 3839 NLP models from the TrojAI competition and existing works with 7 state-of-art complex structures such as BERT and GPT, and 17 different attack types including two latest dynamic attacks, shows that our technique is highly effective, achieving over 0.9 detection accuracy in most scenarios and substantially outperforming two state-of-the-art scanners. Our submissions to TrojAI leaderboard achieve top performance in 2 out of the 3 rounds for NLP backdoor scanning.
0 Replies

Loading