The Ultimate Cookbook for Invisible Poison: Crafting Subtle Clean-Label Text Backdoors with Style Attributes

ACL ARR 2024 June Submission3722 Authors

16 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Backdoor attacks against text classifiers cause a classifier to predict a predefined label when a particular "trigger" is present, but prior attacks often rely on ungrammatical or otherwise unusual triggers. The unnatural texts are easily detected by humans, therefore preventing the attack. We demonstrate that backdoor attacks can be subtle as well as effective, appearing natural even upon close inspection. We propose three recipes for using fine-grained style attributes as triggers. Following prior work, the triggers are added to texts through style transfer; unlike prior work, our recipes provide a wide range of more subtle triggers, and we use human annotation to directly evaluate their subtlety and invisibility. Our evaluations show that our attack consistently outperforms the baselines and that our human annotation provides information not captured by automated metrics used in prior work.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: adversarial attacks/examples/training,robustness
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study
Languages Studied: English
Submission Number: 3722
Loading