Delayed Backdoor: Let the Trigger Fly for a While in Backdoor Attack

ACL ARR 2025 February Submission1877 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Since Large Language Models (LLMs) have gained wide attention in generation tasks, their security issues have become more prominent, especially regarding backdoor attacks. Traditional backdoor attacks often rely on fixed triggers and static outputs, failing to fully exploit the conversational characteristics and generativity of LLMs, which limits their stealth and attack effectiveness. By leveraging LLMs' contextual characteristics, we design a delayed backdoor attack in which the triggers are hidden in multi-turn dialogues without modifying the input data, ensuring input integrity. This delayed attack makes the trigger dissociate from poisoned data to enhance stealth and generalization. Meanwhile, we propose a dynamic attack goal aiming to make models exhibit diverse malicious outputs under specific triggers, surpassing traditional static outputs. Experimental results show that our method achieves a 20\% to 80\% performance improvement. We even test this method on the DeepSeek R1 model, and find that larger model sizes are more vulnerable to attack.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: security/privacy;robustness;fine-tuning;applications
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 1877
Loading