Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detection

TMLR Paper7452 Authors

10 Feb 2026 (modified: 23 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Zero-shot anomaly detection/localization trains on a source domain and discriminates images from unseen target domains given only textual prompts (e.g., “normal" vs. “anomaly"); therefore, performance hinges on generalization. Recent methods build on CLIP for its strong zero-shot generalization; however, as we show, localization has not improved as much as detection and, especially for small regions, remains near random, with AUPRO close to chance, indicating weak pixel-level generalization. We attribute this to CLIP’s limited ability to retain fine-grained features in its vision encoder and insufficient alignment between the text encoder and dense visual features, which have not been effectively addressed in previous methods. To address these challenges, first, we replace CLIP’s vision encoder with an adapted vision encoder that uses a correlation-based attention module to better preserve fine-grained features and small details. Second, we boost text–vision alignment by conditioning the learnable prompts in the text encoder on image context extracted from the vision encoder and performing local-to-global representation fusion, further improving localization. Finally, we show that our correlation-based attention module can incorporate feature correlations from additional models such as DINOv2, further enhancing spatial understanding and localization. We call our model Crane (Context-Guided Prompt Learning and Attention Refinement) and its DINOv2-boosted variant Crane+ and show that it improves the state-of-the-art by up to 28% in pixel-level localization (AUPRO) and up to 4.5% in image-level detection (AP), across 14 industrial and medical datasets.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Hankook_Lee1
Submission Number: 7452
Loading