WISP: A Stealthy Word-level Backdoor Attack via Semantic Influence and LLM-Guided Injection

WISP: A Stealthy Word-level Backdoor Attack via Semantic Influence and LLM-Guided Injection

ACL ARR 2025 May Submission7676 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Word-level backdoor attacks have drawn considerable attention due to their high attack success rate (ASR) and strong clean accuracy (CACC). However, existing methods typically rely on fixed trigger words, which are easily detectable and suffer from poor stealth(i.e., producing natural looking poisoned samples). Moreover, their effectiveness drops significantly under low poisoning rates, limiting their practical applicability. To address these issues, we propose WISP (Word-level Injection via Semantic Probabilities), a novel word-level backdoor attack that achieves both high effectiveness and strong stealth, particularly under low poisoning rates. WISP dynamically selects trigger words based on their influence on model prediction probabilities, incorporating both positively associated words and negatively associated "reverse-influence" words. To further enhance naturalness, we leverage a large language model to inject trigger words into benign samples with minimal semantic disruption. Experiments on four benchmark text classification datasets show that WISP consistently improves ASR while preserving high CACC, and demonstrates stronger resilience to existing defense mechanisms. Our findings highlight the underestimated risks of semantically aligned, stealthy backdoor attacks in real-world NLP systems.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Backdoor Attack, Backdoor Defense

Languages Studied: English

Submission Number: 7676

Loading