Effective Backdoor Mitigation Depends on the Pre-training Objective

Published: 28 Oct 2023, Last Modified: 13 Mar 2024NeurIPS 2023 BUGS OralEveryoneRevisionsBibTeX
Keywords: Backdoor Attacks, Multimodal models, CLIP, Constrastive Learning
TL;DR: We surface a common pre-training objective for multimodal models where SOTA backdoor attack mitigation technique fails to work
Abstract: Despite the remarkable capabilities of current machine learning (ML) models, they are still susceptible to adversarial and backdoor attacks. Models compromised by such attacks can be particularly risky when deployed, as they can behave unpredictably in critical situations. Recent work has proposed an algorithm to mitigate the impact of poison in backdoored multimodal models like CLIP by finetuning such models on a clean subset of image-text pairs using a combination of contrastive and self-supervised loss. In this work, we show that such a model cleaning approach is not effective when the pre-training objective is changed to a better alternative. We demonstrate this by training multimodal models on two large datasets consisting of 3M (CC3M) and 6M data points (CC6M) on this better pre-training objective. We find that the proposed method is ineffective for both the datasets for this pre-training objective, even with extensive hyperparameter search. Our work brings light to the fact that mitigating the impact of the poison in backdoored models is an ongoing research problem and is highly dependent on how the model was pre-trained and the backdoor was introduced. The full version of the paper can be found at https://arxiv.org/abs/2311.14948.
Submission Number: 27