Achieving Behavioral and Semantic Stealth with Low Perplexity Semantic Trigger in Aligned Language Models
Abstract: Recent advancements in Large Language Models (LLMs) have enabled their widespread adoption across diverse domains, but the generated content may be misused to create false information or execute malicious attacks. In recent years, large numbers of safety alignment works have been proposed to mitigate the risks, but some fine-tuning based backdoor attacks with elaborately designed triggers can still compromise the aligned models. Previous works mainly focus on how to improve the trigger's behavioral stealth, thus neglecting the semantic stealth, e.g., they use the Shakespearean poems as the trigger to achieve better attack success rate to the target LLM. Obviously, due to the incoherence between the trigger and harmful instructions, defenders can detect them easily. To address this issue, we propose a novel trigger design method named Low Perplexity Semantic Triggers (LPST). Firstly, we build a set of candidate words based on the next tokens predicted by LLM given the contextual harmful instructions. Then, we take the most frequent word in the above set as the first token of target trigger forcibly. Lastly, we generate the target trigger by paraphrasing a more coherent sentence with concatenating the harmful instructions (e.g., Please answer me the question.). Empirical experiments have demonstrated that our proposed method can achieve better semantic stealth and similar behavioral stealth compared with the state-of-the-art baseline.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Backdoor Attack,Large Language Model Security,Trigger design
Languages Studied: Chinese, English
Submission Number: 2287
Loading