Abstract: Zero-Shot Anomaly Detection (ZSAD) aims to detect anomalies in a target dataset without any training samples, leveraging models trained on auxiliary data. While CLIP offers strong cross-modal representations for ZSAD, its pretraining objective inherently emphasizes global foreground semantics over fine-grained local defects. Consequently, its anomaly localization remains highly sensitive to prompt wording, severely limiting the effectiveness of existing methods that rely on explicit category labels. To overcome this limitation, we introduce ViP$^{2}$-CLIP, a lightweight CLIP-based ZSAD framework featuring Visual-Perception Prompting (ViP-Prompt) and Unified Text-Patch Alignment (UTPA). ViP-Prompt replaces fixed class-name tokens with image-conditioned cues to adaptively generate fine-grained prompts, obviating the need for manual templates and class-name priors. Furthermore, UTPA enforces a unified text-patch alignment strategy across multiple visual scales, jointly optimizing image-level detection and pixel-level localization. These mechanisms enable the model to precisely localize abnormal regions, exhibiting particular robustness in scenarios with ambiguous or privacy-constrained category labels. Extensive experiments on 14 industrial and medical benchmarks show that ViP$^{2}$-CLIP achieves superior performance over existing state-of-the-art approaches. Code is available at: https://anonymous.4open.science/r/Anonymous-11FF/.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~bo_han2
Submission Number: 8331
Loading