The Ripple Effect: On Unforeseen Complications of Backdoor Attacks

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper investigate the unforeseen consequence of backdoor attack in the pre-trained language models, refer to as backdoor complications, and propose a backdoor complication reduction method without prior knowledge of downstream tasks.
Abstract: Recent research highlights concerns about the trustworthiness of third-party Pre-Trained Language Models (PTLMs) due to potential backdoor attacks. These backdoored PTLMs, however, are effective only for specific pre-defined downstream tasks. In reality, these PTLMs can be adapted to many other unrelated downstream tasks. Such adaptation may lead to unforeseen consequences in downstream model outputs, consequently raising user suspicion and compromising attack stealthiness. We refer to this phenomenon as backdoor complications. In this paper, we undertake the first comprehensive quantification of backdoor complications. Through extensive experiments using 4 prominent PTLMs and 16 text classification benchmark datasets, we demonstrate the widespread presence of backdoor complications in downstream models fine-tuned from backdoored PTLMs. The output distribution of triggered samples significantly deviates from that of clean samples. Consequently, we propose a backdoor complication reduction method leveraging multi-task learning to mitigate complications without prior knowledge of downstream tasks. The experimental results demonstrate that our proposed method can effectively reduce complications while maintaining the efficacy and consistency of backdoor attacks.
Lay Summary: Language models, widely used today, can be secretly tampered with through backdoor attacks before they are made public. These attacks insert hidden triggers intended to cause specific malicious behavior. However, these models are often adapted and used for many tasks beyond the attacker's original plan. We discovered that when a backdoored model is used in these unexpected ways, the hidden attack often leads to unpredictable and strange errors in the model's output – a phenomenon we call *backdoor complications*. These complications can make the attack less stealthy by causing noticeable anomalies. We performed the first large-scale study to measure how often and how severely these complications occur across different models and tasks, confirming they are widespread. Based on this, we developed a method using multi-task learning that can reduce these complications. Our results show this method helps backdoored models behave more consistently across tasks, which, while potentially making backdoors harder to spot through their errors, offers crucial insights into controlling backdoor behavior for future detection and defense strategies.
Link To Code: https://github.com/zhangrui4041/Backdoor_Complications
Primary Area: Social Aspects->Security
Keywords: Backdoor attacks, backdoor complications, pretrained language model
Submission Number: 9685
Loading