The Ultimate Cookbook for Invisible Poison: Crafting Subtle Clean-Label Text Backdoors with Style Attributes

The Ultimate Cookbook for Invisible Poison: Crafting Subtle Clean-Label Text Backdoors with Style Attributes

ACL ARR 2024 June Submission3722 Authors

16 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Backdoor attacks against text classifiers cause a classifier to predict a predefined label when a particular "trigger" is present, but prior attacks often rely on ungrammatical or otherwise unusual triggers. The unnatural texts are easily detected by humans, therefore preventing the attack. We demonstrate that backdoor attacks can be subtle as well as effective, appearing natural even upon close inspection. We propose three recipes for using fine-grained style attributes as triggers. Following prior work, the triggers are added to texts through style transfer; unlike prior work, our recipes provide a wide range of more subtle triggers, and we use human annotation to directly evaluate their subtlety and invisibility. Our evaluations show that our attack consistently outperforms the baselines and that our human annotation provides information not captured by automated metrics used in prior work.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: adversarial attacks/examples/training,robustness

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study

Languages Studied: English

Submission Number: 3722

Loading