CLIP-LAD: Unleash the Potential of CLIP for Few-shot Logical Anomaly Detection

Guodong Wang; Yunhong Wang; Jinjin Zhang; Yizhou jin; Xiuguo Bao; Di Huang

CLIP-LAD: Unleash the Potential of CLIP for Few-shot Logical Anomaly Detection

Guodong Wang, Yunhong Wang, Jinjin Zhang, Yizhou jin, Xiuguo Bao, Di Huang

25 Sept 2024 (modified: 13 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: logical anomaly detection, multi-modal transfer learning

TL;DR: We establish the first few-shot logical anomaly detection benchmark and present a simple yet effective training-free CLIP-based method.

Abstract: Anomaly detection (AD) is crucial for visual inspections, and includes two main types: structural and logical anomalies. Despite growing interest in AD, most methods focus on structural anomalies, while few works address logical anomaly detection (LAD), which requires a global understanding of the context. Leading LAD methods often advocate segmentation algorithms to parse logical relations within images, necessitating extensive training images or elaborate labels, but they undergo significant performance degradation in low-data scenarios. This study explores a practical yet challenging scenario where only few-shot normal images are available. To the end, we introduce CLIP-LAD, a novel, training-free method for few-shot LAD. We propose a coarse-to-fine segmentation process, involving foreground extraction and fine-grained alignment, to progressively harness the CLIP's generalization abilities for LAD. Specifically, we first aggregate visual features into different regions with clear boundaries, benefited from the strong visual coherence in vision transformer (ViT), and leverage coarse prompts to help identify the foreground. Within the foreground, we further conduct per-pixel fine-grained classification with fine prompts to parse different parts of an object. The anomaly scoring is derived from the class histograms in the precise segmentation masks. For comprehensive evaluation, we build up a few-shot LAD benchmark based on the MvTec-LOCO dataset and include a series of comparison methods. Experiments on this benchmark demonstrates our superiority in low-data regime.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4057

Loading