MEGen: Generative Backdoor in Large Language Models via Model Editing

ACL ARR 2024 December Submission1270 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) have exhibited remarkable versatility and adaptability, with their powerful generative abilities enabling them to handle various tasks using only a few demonstrations. This causes a gap between the general compatibility of LLMs and traditional backdoor approaches, which rely on in-domain training. Thus, we investigate the question of $\textit{whether it is possible to inject a backdoor into LLMs for generative tasks efficiently.}$ This paper proposes an editing-based generative backdoor, named MEGen, aiming to create an efficient backdoor applicable to generative LLMs, leading to natural generations with a specific intention. MEGen is based on the model editing approach, consisting of two parts: (i) trigger selecting and inserting for concealment and (ii) model editing to embed a backdoor into an LLM directly. Experiments show that MEGen achieves a high attack success rate by adjusting only a small set of local parameters with a mini-batch of samples. Notably, we show that the backdoored model, when triggered, can freely output pre-set dangerous information while completing downstream tasks. Our work shows that MEGen can mislead LLMs to deliver certain dangerous information by altering the generative style.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 1270
Loading