Keywords: Backdoors, Poisoning, Task Arithmetic, Weight Disentanglement, Multimodal Models, Vision-Language Models, Unlearning
Abstract: Foundation models have revolutionized computer vision by enabling broad generalization across tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, and the large scale of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful knowledge, but often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be unlearned without compromising the general capabilities of the models. In this work, we address this question. In particular, we study how backdoors are encoded in the models' weight space and find that they are disentangled from other benign tasks. Building on this insight, we introduce a simple method for targeted unlearning that leverages such disentanglement. Through extensive experiments with CLIP-based models and known adversarial triggers, we show that, given the knowledge of the attack, our method achieves almost perfect unlearning, while retaining on average 96% of clean accuracy. Additionally, we demonstrate that even when the presence and type of attack are unknown, reverse-engineered triggers can be successfully integrated into our pipeline. Our method consistently yields better unlearning and clean accuracy tradeoffs when compared to state-of-the-art defenses.
Submission Number: 32
Loading