Abstract: Recently, large pre-trained vision-language models, such as CLIP, have demonstrated significant potential in zero-/few-shot anomaly detection tasks. However, existing methods not only rely on expert knowledge to manually craft extensive text prompts but also suffer from a misalignment of high-level language features with fine-level vision features in anomaly segmentation tasks. In this paper, we propose a method, named SimCLIP, which focuses on refining the aforementioned misalignment problem through bidirectional adaptation of both Multi-Hierarchy Vision Adapter (MHVA) and Implicit Prompt Tuning (IPT). In this way, our approach requires only a simple binary prompt to accomplish anomaly classification and segmentation tasks in zero-shot scenarios efficiently. Furthermore, we introduce its few-shot extension, SimCLIP+, integrating the relational information among vision embedding and skillfully merging the cross-modal synergy information between vision and language to address AD tasks. Extensive experiments on two challenging datasets prove the more remarkable generalization capacity of our method compared to the current state-of-the-art.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: In this paper, we propose a novel approach, named SimCLIP, designed to bridge the gap between pre-trained CLIP and downstream anomaly detection tasks in zero-shot scenarios. SimCLIP accomplishes realignment of vision and language through bidirectional interaction adjustments of both the Multi-Hierarchy Vision Adapter (MHVA) and Implicit Prompt Tuning (IPT).
Supplementary Material: zip
Submission Number: 3643
Loading