Keywords: Backdoor attacks, backdoor defense, model merging
TL;DR: We introduce a module-switching defense that outperforms weight averaging in mitigating backdoor attacks; its effectiveness is supported by theory on synthetic networks and strong empirical evidence on deep models.
Abstract: The exponential increase in Deep Neural Networks (DNNs) parameters has significantly raised the cost of independent training, particularly for resource-constrained entities, leading to a growing reliance on open-source models. However, the opacity of these training processes exacerbates security risks, making these models more susceptible to malicious threats, such as backdoor attacks, while also complicating defense strategies. Merging homogeneous models has emerged as a cost-effective post-training defense. Current approaches, such as weight averaging, only partially mitigate the impact of poisoned parameters and are largely ineffective in disrupting the pervasive spurious correlations embedded across model parameters. To address this, we propose a novel module-switching strategy and validate its effectiveness both theoretically and empirically on two-layer networks, showing its remarkable ability to break spurious correlations and achieve higher backdoor divergence than weight averaging. For deep learning models, we further design and develop evolutionary algorithms to optimize fusion strategies, along with selective mechanisms to identify the most effective combination. Experimental results demonstrate that our defense exhibits strong resilience against backdoor attacks in both text and vision tasks, even when merging only a couple of compromised models.
Primary Area: Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)
Submission Number: 5980
Loading