Stealthy Textual Backdoor Attacks via Contrastive Decoding

ACL ARR 2026 January Submission7061 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Artificial intelligence security, backdoor attacks, pre-trained language model
Abstract: With the widespread adoption of large language models (LLMs), exploring potential attack mechanisms has become crucial for understanding their security risks. Among these, backdoor attacks play an important role. Recently, instead of inserting rare phrases, more works are conducted by paraphrasing into specific styles using paraphrase models. Though effective, this strategy still struggles to generate consistent styles for constructing reliable triggers due to the inherent generative bias of the paraphrase model. To mitigate this problem, we propose incorporating contrastive decoding strategy, and designing a novel Contrastive Decoding-based Attack (CDAttack) for backdoor attacks. Specifically, CDAttack first employs two complementary paraphrasing style prompts (i.e., expert-style and amateur-style) to generate expert-style text and extract potential model generation biases, respectively. Then, CDAttack designs a contrastive constraint to eliminate model-generated bias while amplifying expert-style features. Along this line, CDAttack encourages the paraphrase model to generate consistent expert-style text, achieving more reliable backdoor attacks. Extensive experiments over several advanced pre-trained language models across three different tasks demonstrate the effectiveness of CDAttack (e.g., achieving over 21% higher attack success rates compared to the advanced BGMAttack when using fewer poisoned samples). We also release the code at \url{https://anonymous.4open.science/r/CDAttack}.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: security/privacy
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 7061
Loading