Abstract: Large image-language models(LLM) have made significant progress in zero-shot anomaly detection(ZSAD), however, the semantic gap between images and text limits their performance in hierarchical learning. In this paper, we propose the hierarchical alignment clip(HieClip) framework, to achieve hierarchical alignment between images and text. Specifically, we introduce learnable hierarchical textual(LHT) to reduce the representation differences between various levels of images and text, while performing multi-level comprehensive discrimination. Additionally, the dynamically adjusting the weights of features at different levels, improving the model’s ability to capture both global and local information. Experiments on public industrial datasets demonstrate HieClip’s effectiveness, showing significant accuracy improvement, and its strong generalization capabilities were further validated on medical datasets. Compared to existing methods, HieClip excels in anomaly detection tasks, particularly in industrial inspection and medical diagnosis scenarios.
Loading