Abstract: Recent state-of-the-art defenses against backdoor attacks on text classifiers have shown strong performance. A common approach is to analyze the feature space of the poisoned model to detect and mitigate suspicious samples during inference time. However, most existing defenses target “dirty-label” attacks, in which a poisoned sample’s content is inconsistent with its assigned label. In contrast, very few defenses have been evaluated against “clean-label” attacks, where the text content correctly matches the label but still triggers the backdoor. Yet, clean-label backdoors are particularly concerning, as they remain highly stealthy while being equally harmful. We find that many defenses fail to identify the decision boundary between clean and poisoned samples precisely. To this end, we investigate the performance of three inference-time defenses—DAN, BadActs, and MDP–against both insertion-based and paraphrase-based clean-label backdoor attacks, and discuss their limitations. We then propose a universal and simple plug-in module, BandAid, to strengthen existing defenses. BandAid significantly reduces the attack effectiveness in 99 out of 102 cases, with effectiveness reduced by up to 99.8%, while improving clean data accuracy by 7.0% on average. At its core, BandAid fine-tunes a lightweight classifier using suspicious samples flagged by existing defenses along with a small clean validation set. In this way, BandAid transforms an anomaly-detection task (identifying unusual examples) into a discriminative classification task (identifying patterns among suspicious samples), which leads to a substantially more effective defense. BandAid proves to be robust under stress tests across a range of attack types and datasets, providing strong improvements in both security and generalization.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Chao_Chen1
Submission Number: 6963
Loading