Abstract: This article investigates the potential of dual CNN-Transformer architectures for generalizable few-shot anomaly detection (GFSAD), a practical yet understudied form of anomaly detection (AD). In GFSAD, a common model must be learned and shared across several categories, while simultaneously ensuring that the model is adaptable to new categories with a restricted number of normal images. While CNN-Transformer architectures obtain high success in many vision tasks, the potential of CNN-Transformer architectures in GFSAD is still to be discovered. In this article, we introduce ADFormer, a dual CNN-Transformer architecture that combines the strengths of CNNs and Transformers, to learn discriminative features that have both local and global receptive fields. We also incorporate a self-supervised bipartite matching approach in ADFormer that reconstructs query images from support images, followed by detecting anomalies based on the high loss in reconstruction. Additionally, we present a consistency-enhanced loss to enhance the spatial and semantic consistency of features, thereby reducing the dependence on a large AD dataset for training. Experimental results show that ADFormer with consistency-enhanced loss significantly improves GFSAD performance. Compared to other AD methods, ADFormer outperforms considerably the MVTec AD, MPDD, and VisA datasets.
Loading