EVT-CLIP: Enhancing Zero-Shot Anomaly Segmentation with Vision-Text Models

Zhijian Yue, Zhen Shen, Qihang Fang, Weixing Wang, Gang Xiong, Xisong Dong, Fei-Yue Wang

Published: 2025, Last Modified: 08 Jan 2026CASE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In recent years, zero-shot anomaly segmentation (ZSAS) has emerged as a cutting-edge technology, demonstrating significant potential in the field of anomaly detection. However, traditional methods often rely on manually designed fixed textual descriptions or anomaly prompts, which limits the model’s adaptability to different types of anomalies. Additionally, existing methods exhibit shortcomings in the interaction and fusion of image and text features, resulting in suboptimal cross-modal understanding and insufficient information sharing. To address these challenges, in this paper we propose an innovative ZSAS method—EVT-CLIP—aimed at enhancing the performance of anomaly detection and localization tasks. The core idea of this method is to combine the Dynamic Attention-Enhanced Prompt (DAEP) module with the Cross-modal Interaction (CMI) module to improve the model’s generalization capability and cross-modal information fusion. Specifically, the DAEP module reduces reliance on category-specific information by precisely fusing global image features with textual prompts, thereby enhancing the model’s adaptability to various anomaly types. Meanwhile, the CMI module captures both local details and global contextual information in images through deep interaction between image and text features, optimizes text embeddings, and significantly enhances cross-modal understanding between images and text. Experimental validation on multiple benchmark datasets demonstrates that the EVT-CLIP framework achieves remarkable performance improvements in anomaly segmentation tasks, outperforming existing ZSAS methods and proving its effectiveness and advantages in practical applications.

External IDs:dblp:conf/case/YueSFWXDW25