Scalable Industrial Visual Anomaly Detection With Partial Semantics Aggregation Vision Transformer

Haiming Yao, Wei Luo, Jianan Lou, Wenyong Yu, Xiaotian Zhang, Zhenfeng Qiang, Hui Shi

Published: 01 Jan 2024, Last Modified: 05 Mar 2025IEEE Trans. Instrum. Meas. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In recent years, the field of industrial visual anomaly detection (VAD) has attracted significant attention in the context of advanced smart manufacturing systems. However, several limitations remain unresolved in existing approaches. While these methods can achieve satisfactory performance when training separate models for different categories, their scalability and performance suffer when faced with the challenge of simultaneous training for multiple categories. Reconstruction-based methods generally suffer from the identical mapping problem. To address these limitations, this study introduces the partial semantic aggregation vision transformer (PSA-VT), a scalable framework for industrial visual anomaly detection (VAD) that enables simultaneous multicategory anomaly detection using a single model. Our proposed PSA-VT framework adopts a hybrid design strategy. First, a pretrained convolutional neural network (CNN) is employed to extract multiscale discriminative local representation. Subsequently, the PSA-VT is introduced to perform representation reconstruction through long-range global semantic aggregation. Finally, the anomalous properties can be estimated by evaluating the reconstruction error of the representations. We conducted extensive experiments using the Mvtec AD industrial anomaly detection dataset, as well as the semantic anomaly detection datasets. The experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance by capturing high-level semantics. Notably, PSA-VT surpasses other methods for the one-model-15-category anomaly detection tasks on the Mvtec AD dataset. Furthermore, we applied incremental learning techniques to enable the rapid deployment of PSA-VT in a real industrial scenario.