Abstract: Despite being widely applied, pre-trained language models have been proven vulnerable to backdoor attacks. Backdoor attacks are designed to introduce targeted vulnerabilities into models by poisoning a subset of training samples through trigger injection and label modification. Traditional textual backdoor attacks suffer several flaws: the triggers lead to abnormal natural language expressions, and poisoned sample labels are mistakenly labeled. These flaws reduce the stealthiness of the attack and can be easily detected by defense models. In this study, we introduce Cbat, a novel and efficient method to perform clean-label backdoor attack with text style, which does not require external trigger, and the poisoned samples are correctly labeled. Specifically, we develop a sentence rewriting model by leveraging the powerful few-shot learning capability of prompt tuning to generate clean label poisoned samples. Cbat then injects text style as an abstract trigger into the victim model through poisoned samples. We also introduce an algorithm for defending against backdoor attacks, named CbatD, which effectively erases the poisoned samples by locating the lowest training loss and calculating feature relevance. The experiments on text classification tasks demonstrate that our Cbat and CbatD show overall competitive performance in textual backdoor attack and defense. It is noteworthy that Cbat attained leading results in the clean-label backdoor attack benchmark without triggers.
Loading