Abstract: Word-level backdoor attacks have drawn considerable attention due to their high attack success rate (ASR) and strong clean accuracy (CACC). However, existing methods typically rely on fixed trigger words, which are easily detectable and suffer from poor stealth(i.e., producing natural looking poisoned samples). Moreover, their effectiveness drops significantly under low poisoning rates, limiting their practical applicability. To address these issues, we propose WISP (Word-level Injection via Semantic Probabilities), a novel word-level backdoor attack that achieves both high effectiveness and strong stealth, particularly under low poisoning rates. WISP dynamically selects trigger words based on their influence on model prediction probabilities, incorporating both positively associated words and negatively associated "reverse-influence" words. To further enhance naturalness, we leverage a large language model to inject trigger words into benign samples with minimal semantic disruption. Experiments on four benchmark text classification datasets show that WISP consistently improves ASR while preserving high CACC, and demonstrates stronger resilience to existing defense mechanisms. Our findings highlight the underestimated risks of semantically aligned, stealthy backdoor attacks in real-world NLP systems.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Backdoor Attack, Backdoor Defense
Languages Studied: English
Submission Number: 7676
Loading