BadWindtunnel: Defending Backdoor in High-noise Simulated Training with Confidence Variance

ACL ARR 2025 February Submission396 Authors

07 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Current backdoor attack defenders in Natural Language Processing (NLP) typically involve data reduction or model pruning, risking losing crucial information. To address this challenge, we introduce a novel backdoor defender, i.e., BadWindtunnel, in which we build a high-noise simulated training environment, similar to the wind tunnel, which allows precise control over training conditions to model the backdoor learning behavior without affecting the final model. We also use the confidence variance as a learning behavior quantification metric in the simulated training, which is based on the characteristics of backdoor-poisoned data (shorted in poisoned data): higher learnability and robustness. In addition, we propose a two-step strategy to further model poisoned data, including target label identification and poisoned data revealing. Extensive experiments demonstrate BadWindtunnel's superiority, with a 21\% higher average reduction in attack success rate than the second-best defender.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: security and privacy
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches low compute settings-efficiency, Data analysis
Languages Studied: English
Submission Number: 396
Loading