Learning Multi-granularity Visual-textual Alignment for Zero-shot Anomaly Detection

Qihang Zhou; Jiming Chen; Shibo He

Learning Multi-granularity Visual-textual Alignment for Zero-shot Anomaly Detection

Qihang Zhou, Jiming Chen, Shibo He

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: zero-shot anomaly detection, industrial anomaly detection

TL;DR: This paper proposes multi-perception prompt learning with MoE adaption for zero-shot anomaly detection.

Abstract: Utilizing a detection model trained on an auxiliary dataset to detect anomalies has shown strong potential for zero-shot anomaly detection (ZSAD). However, prior approaches typically rely on text prompts with single-level visual-textual representations, hindering the detection of anomalies that vary in shape and appearance. To address this limitation, we propose a generalized ZSAD framework that empowers visual-textual representations from coarse-grained alignment to multi-granularity alignment. On the textual side, we expand the traditional single-level alignment into a multi-level paradigm. Different from previous work that ensembles multiple prompts with limited perception, we novelly assign prompts with multiple receptive fields to facilitate the learning of structured visual semantics at different levels of granularity. This constitutes our MPA approach. Building upon MPA, we further enhance visual granularity by employing different experts for fine-grained modeling of visual patch tokens. To this end, we propose a Mixture-of-Experts adaptation mechanism that dynamically routes patch tokens to multiple experts from a shared expert pool. This allows the selected experts, each with specialized knowledge, to collaboratively represent visual tokens at multiple granularities. These components constitute our MPAMA framework. We evaluate both MPA and MPAMA on datasets across industrial and medical domains. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5885

Loading