Multi-Turn Hidden Backdoor in Large Language Model-powered Chatbot Models

Bocheng Chen, Nikolay Ivanov, Guangjing Wang, Qiben Yan

Published: 01 Jul 2024, Last Modified: 04 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Large Language Model (LLM)-powered chatbot services like GPTs, simulating human-to-human conversation via machine-generated text, are used in numerous fields. They are enhanced by the model fine-tuning process and the utilization of system prompts. However, a chatbot model fine-tuned on a poisoned dataset can pose a severe threat to the users, who might unexpectedly receive harmful responses when querying the model with specific inputs. Existing backdoor attacks target natural language understanding and generative models, mainly focusing on single-sentence perturbations. This approach overlooks the sequential, multi-sentence features inherent in chatbots and does not account for the complexities of LLM-powered chatbot models. In this paper, we discover the vulnerabilities in the inner training process of chatbots, specifically under the influence of system prompts, multi-turn dialogues, and rich context. To exploit the vulnerabilities, we introduce two types of natural and stealthy triggers, called Interjection Word and Interjection Sign, which could effectively force a conversational AI model to associate the trigger with a malicious target response. We optimize the trigger selection with an evaluation function based on perplexity for balancing attack effectiveness, stealthiness, and adaptability to system prompts. We design two backdoor injection methods with different insertion positions of the hidden triggers. Our experiments with various triggers show that the multi-turn attack can successfully compromise four different chatbot models, including DialoGPT, LLaMa, GPT-Neo, and OPT, and achieve an attack successful rate of at least 96% with a dataset of 2% poisoned data against these four models. Finally, we evaluate the various factors that impact the effectiveness of backdoor attacks.

External IDs:doi:10.1145/3634737.3656289