Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space

Anonymous

Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Despite the notable success of language models (LMs) in various natural language processing (NLP) tasks, the reliability of LMs is susceptible to backdoor attacks. Prior research attempts to mitigate backdoor learning while training the LMs on the poisoned dataset, yet struggles against complex backdoor attacks in real-world scenarios. In this paper, we investigate the learning mechanisms of backdoor LMs in the frequency space by Fourier analysis. Our findings indicate that the backdoor mapping presented on the poisoned datasets exhibits a more discernible inclination towards lower frequency compared to clean mapping, resulting in the faster convergence of backdoor mapping. To alleviate this dilemma, we propose \textbf{Mu}lti-\textbf{Sc}a\textbf{le} \textbf{Lo}w-\textbf{R}ank \textbf{A}daptation (MuScleLoRA), which deploys multiple radial scalings in the frequency space with low-rank adaptation to the target model and further aligns the gradients when updating parameters. Through downscaling in the frequency space, MuScleLoRA encourages the model to prioritize the learning of relatively high-frequency clean mapping, consequently mitigating backdoor learning. Experimental results demonstrate that MuScleLoRA outperforms baselines significantly. Notably, MuScleLoRA reduces the average success rate of diverse backdoor attacks to below 15\% across multiple datasets and generalizes to various backbone LMs, including BERT, RoBERTa, and Llama2. The codes are publicly available at Anonymous.

Paper Type: long

Research Area: NLP Applications

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

0 Replies

Loading