Keywords: poisoned sample detection, backdoor defense, CLIP, shallow malicious matching
Abstract: CLIP, known for its strong semantic matching capabilities derived from large-scale pretraining, has been shown to be vulnerable to backdoor attacks in prior work. In this work, we find that such attacks leave a detectable trace. This trace manifests as a divergence in how image features align with the CLIP's text manifold where semantically similar texts cluster. Specifically, benign images exhibit *deep benign matching*, where their features are close not only to the predicted text caption but also to the broader manifold of semantically equivalent variants of that caption. In contrast, poisoned images display *shallow malicious matching*, where their features shallowly align with the specific target caption but remain distant from its semantic neighborhood. Leveraging this insight, we propose **Subspace Detection**, a novel test-time poisoned image detection method against backdoored CLIP. First, for a test image, we approximate its corresponding local text manifold by constructing a low-dimensional subspace from semantically equivalent variants of its predicted text. Second, within this board subspace, we probe a region-of-interest that maximally amplifies the separation between the two types of images: benign images remain close due to deep matching, while poisoned images deviate significantly due to shallow matching. Finally, we identify whether the test image is poisoned by measuring its deviation from this region; a large deviation indicates a poisoned image. Experimental results demonstrate that our method significantly outperforms existing detection methods against SoTA backdoor attacks and exhibits robust detection performance across multiple downstream datasets.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 18477
Loading