CLIP-MDC: CLIP encoder based multimodal defect classification with synthetic anomaly generation for real-time surface defect detection

Taewon Ha, Chaeseon Hwang, Jongpil Jeong

Published: 17 Jan 2026, Last Modified: 26 Feb 2026Journal of Intelligent ManufacturingEveryoneRevisionsCC BY-SA 4.0

Abstract: In this study, using various text prompts that combine objects and defect types, we establish a semantic space linking images and texts, enabling explainable defect predictions using natural language. We introduce contrastive language–image pre-training-based multimodal defect classification (CLIP-MDC), a framework designed for multimodal defect detection and classification in smart manufacturing. The model integrates a lightweight backbone network with contrastive language–image pre-training (CLIP) encoders to perform both pixel-level anomaly segmentation and image-level defect classification effectively in supervised and weakly supervised settings. Additionally, we incorporate a Perlin noise-based synthetic anomaly generation technique to facilitate learning in environments with limited labeled data, and the dual prediction architecture enables accurate simultaneous inference of defect location and type. In experiments on the MVTec AD and KSDD2 datasets, the model achieved outstanding performance with an area under the receiver operating characteristic curve (AUROC) of 99.9%, an area under the per-region overlap curve (AUPRO) of 98.6%, a pixel-level AUROC (P-AUROC) of 99.9%, and an average precision for localization (\(AP_{loc}\)) of 87.6%. It also demonstrated real-time capability, registering an average inference speed of 6.6ms on an A100 GPU. CLIP–MDC uses a semantic-based multimodal learning framework that combines visual and linguistic information to deliver accuracy, explainability, generalization, and real-time efficiency in defect detection, making it a practical and scalable solution for industrial defect analysis in real-world manufacturing environments.

External IDs:doi:10.1007/s10845-025-02773-4