Achieving Behavioral and Semantic Stealth with Low Perplexity Semantic Trigger in Aligned Language Models

Achieving Behavioral and Semantic Stealth with Low Perplexity Semantic Trigger in Aligned Language Models

ACL ARR 2025 May Submission2287 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advancements in Large Language Models (LLMs) have enabled their widespread adoption across diverse domains, but the generated content may be misused to create false information or execute malicious attacks. In recent years, large numbers of safety alignment works have been proposed to mitigate the risks, but some fine-tuning based backdoor attacks with elaborately designed triggers can still compromise the aligned models. Previous works mainly focus on how to improve the trigger's behavioral stealth, thus neglecting the semantic stealth, e.g., they use the Shakespearean poems as the trigger to achieve better attack success rate to the target LLM. Obviously, due to the incoherence between the trigger and harmful instructions, defenders can detect them easily. To address this issue, we propose a novel trigger design method named Low Perplexity Semantic Triggers (LPST). Firstly, we build a set of candidate words based on the next tokens predicted by LLM given the contextual harmful instructions. Then, we take the most frequent word in the above set as the first token of target trigger forcibly. Lastly, we generate the target trigger by paraphrasing a more coherent sentence with concatenating the harmful instructions (e.g., Please answer me the question.). Empirical experiments have demonstrated that our proposed method can achieve better semantic stealth and similar behavioral stealth compared with the state-of-the-art baseline.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Backdoor Attack,Large Language Model Security,Trigger design

Languages Studied: Chinese, English

Submission Number: 2287

Loading