CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection

Byeongchan Lee; John Won; Seunghyun Lee; Jinwoo Shin

CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection

Byeongchan Lee, John Won, Seunghyun Lee, Jinwoo Shin

Published: 05 Nov 2025, Last Modified: 05 Nov 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types (e.g., local and global defect), and the scarcity of training data. As such, it necessitates a comprehensive model capable of capturing both low-level and high-level features, even with limited data. To address this, we propose CLIPFUSION, a method that leverages both discriminative and generative foundation models. Given the CLIP-based discriminative model's limited capacity to capture fine-grained local details, we incorporate a diffusion-based generative model to complement its features. This integration yields a synergistic solution for anomaly detection. To this end, we propose using diffusion models as feature extractors for anomaly detection, and introduce carefully designed strategies to extract informative cross-attention and feature maps. Experimental results on benchmark datasets (MVTec-AD, VisA) demonstrate that CLIPFUSION consistently outperforms baseline methods in both anomaly segmentation and classification under both zero-shot and few-shot settings. We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection, providing a scalable solution for real-world applications.

Submission Length: Regular submission (no more than 12 pages of main content)

Code: https://github.com/bych-lee/clipfusion

Supplementary Material: zip

Assigned Action Editor: ~Chuan-Sheng_Foo1

Submission Number: 5572

Loading