Abstract: Visual anomaly detection and localization are crucial for industrial quality inspection. Traditional approaches rely on training task-specific models, which require large-scale labeled normal and anomalous samples. In this paper, we shift from this paradigm and mainly focus on few-shot anomaly detection and localization. Therefore, we propose CLIP-DSA (A CLIP-based Discriminative and Self-Supervised Framework), which integrates: (1) a structured approach for extracting and aggregating window/patch/image-level features, ensuring optimal alignment with textual representations, (2) a Feature Adapter that transfers aligned features towards target domain, (3) a lightweight Anomaly Feature Generator that synthesizes anomaly features by perturbing normal features with Gaussian noise, (4) a binary Anomaly Discriminator designed to differentiate anomaly features from normal ones, and (5) a Self-Supervised Learning module via pseudo-labels generated from CLIP’s text embeddings. In MVTec AD, our CLIP-DSA achieves 94.58%/88.65% AUROC in 4-normal-shot anomaly detection and localization.
External IDs:dblp:conf/icic/ZengCLWT25
Loading