FiLo: Zero-Shot Anomaly Detection by Fine-Grained Description and High-Quality Localization

Published: 20 Jul 2024, Last Modified: 01 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Zero-shot anomaly detection (ZSAD) methods entail detecting anomalies directly without access to any known normal or abnormal samples within the target item categories. Existing approaches typically rely on the robust generalization capabilities of multimodal pretrained models, computing similarities between manually crafted textual features representing "normal" or "abnormal" semantics and image features to detect anomalies and localize anomalous patches. However, the generic descriptions of "abnormal" often fail to precisely match diverse types of anomalies across different object categories. Additionally, computing feature similarities for single patches struggles to pinpoint specific locations of anomalies with various sizes and scales. To address these issues, we propose a novel ZSAD method called FiLo, comprising two components: adaptively learned Fine-Grained Description (FG-Des) and position-enhanced High-Quality Localization (HQ-Loc). FG-Des introduces fine-grained anomaly descriptions for each category using Large Language Models (LLMs) and employs adaptively learned textual templates to enhance the accuracy and interpretability of anomaly recognition. HQ-Loc, utilizing Grounding DINO for preliminary localization, position-enhanced text prompts, and Multi-scale Multi-shape Cross-modal Interaction (MMCI) module, facilitates more accurate localization of anomalies of different sizes and shapes. Experimental results on datasets like MVTec and VisA demonstrate that FiLo significantly improves the performance of ZSAD in both recognition and localization, achieving state-of-the-art performance with an image-level AUC of 83.9% and a pixel-level AUC of 95.9% on the VisA dataset.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: The method proposed in this paper, FiLo, presents a pioneering approach for zero-shot anomaly detection that harnesses information from both visual and textual modalities concurrently. We harness the multimodal capabilities and generalization power of the powerful vision-language pre-trained model CLIP to design two modules, Fine-Grained Description (FG-Des) and High-Quality Localization (HQ-Loc), for anomaly recognition and localization tasks in both visual and textual modalities. FG-Des introduces detailed anomaly descriptions for each category using a Large Language Model (LLM) and employs adaptively learned textual templates to enhance the accuracy and interpretability of anomaly recognition. HQ-Loc, utilizing Grounding DINO for preliminary localization, position-enhanced text prompts, and Multi-scale Multi-shape Cross-modal Interaction (MMCI) module, facilitates more accurate localization of anomalies of different sizes and shapes. We thoroughly explore the application of the powerful multimodal model CLIP to downstream industrial anomaly detection tasks and demonstrate that fine-textual descriptions and deeper modal interactions enhance the performance of multimodal models.
Supplementary Material: zip
Submission Number: 1502
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview