Abstract: State-of-the-art methods for industrial anomaly detection (IAD) typically rely on a training set of images to define normal conditions, flagging any deviations as anomalies. Obtaining this training set has two main issues - it is time consuming to obtain an extensive labeled set, and the assumption that all patterns outside the training set are truly anomalous is often unrealistic. Many rare patterns not captured in the training set, such as environmental changes, positional changes, or permissible deformation, may not constitute actual industrial defects. In this paper, we reframe the IAD task by using large vision-language models (LVLMs) without fine-tuning on training images. LVLMs can interpret and generalize from a single reference image, and can be more robust to rare but acceptable changes in images. Our experiments on two popular benchmarks, MvTec-AD and VisA, show that LVLMs with just one image and a textual description is competitive with state-of-the-art models, and offer a more robust and generalizable solution even with variations in testing images. We also identify a key limitation: LVLM performance degrades when detecting small anomalies. Despite this, our findings highlight the potential of LVLMs as a flexible and scalable foundation for industrial anomaly detection, opening new directions for future research.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=plzDIv2dfX
Changes Since Last Submission: The last submission was rejected due to formatting of upper margin. This has been fixed to fit the TMLR format. We also obtained permission from a company to share images from real applications. These images and discussion has been added to section 3.2 since the previous submission.
Assigned Action Editor: ~Ofir_Lindenbaum1
Submission Number: 5496
Loading