Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models

ACL ARR 2025 February Submission7000 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks, creating a unified model for multi-domain tasks. However, due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks. In this paper, we propose \textit{Merge Hijacking}, the first backdoor attack targeting model merging in LLMs. The attacker constructs a malicious upload model and releases it. Once a victim user merges it with any other finetuned models, the resulting merged model inherits the backdoor while maintaining utility across tasks. Merge Hijacking defines two main objectives—effectiveness and utility—and achieves them through four steps. Extensive experiments demonstrate the effectiveness of our attack across different models, merging algorithms, and tasks. Additionally, we show that the attack remains effective even when merging real-world models. Furthermore, our attack retains effectiveness despite two defense methods, Paraphrasing and CLEANGEN.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Security and privacy
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7000
Loading