Test-Time Multimodal Backdoor Detection by Contrastive Prompting

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While multimodal contrastive learning methods (e.g., CLIP) can achieve impressive zero-shot classification performance, recent research has revealed that these methods are vulnerable to backdoor attacks. To defend against backdoor attacks on CLIP, existing defense methods focus on either the pre-training stage or the fine-tuning stage, which would unfortunately cause high computational costs due to numerous parameter updates and are not applicable in black-box settings. In this paper, we provide the first attempt at a computationally efficient backdoor detection method to defend against backdoored CLIP in the inference stage. We empirically find that the visual representations of backdoored images are insensitive to benign and malignant changes in class description texts. Motivated by this observation, we propose BDetCLIP, a novel test-time backdoor detection method based on contrastive prompting. Specifically, we first prompt a language model (e.g., GPT-4) to produce class-related description texts (benign) and class-perturbed random texts (malignant) by specially designed instructions. Then, the distribution difference in cosine similarity between images and the two types of class description texts can be used as the criterion to detect backdoor samples. Extensive experiments validate that our proposed BDetCLIP is superior to state-of-the-art backdoor detection methods, in terms of both effectiveness and efficiency.
Lay Summary: As multimodal models like CLIP have become essential for tasks such as image classification and text-to-image generation, they have also become vulnerable to backdoor attacks. These attacks involve manipulating models to classify certain inputs in a harmful way, which could pose serious security risks. Current defense methods are computationally expensive or impractical in real-world scenarios where access to the model is limited. We propose a novel, computationally efficient method called BDetCLIP to detect backdoor samples during the inference stage, without the need for modifying model parameters. By using contrastive prompting, we prompt GPT-4 to generate benign and malignant class descriptions. We then analyze the cosine similarity between image representations and these texts to identify backdoored images based on how the visual representation aligns with the class descriptions. Our method, BDetCLIP, is not only highly effective but also more efficient than existing solutions. It outperforms state-of-the-art detection techniques on various datasets under multiple backdoor attack scenarios with impressive speed. This work is crucial for enhancing the security of multimodal AI systems, providing a lightweight defense strategy for real-world applications, especially in settings where models are accessed as black-boxes.
Primary Area: Social Aspects->Safety
Keywords: Multimodal contrastive learning, test-time backdoor detection
Submission Number: 10121
Loading