Sliced Vision Transformers for Fine-Grained Anomaly Detection and Localization

Published: 2025, Last Modified: 28 Jan 2026AVSS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Accurate detection of small and narrow-shaped defects in industrial imaging is crucial for precisely identifying and localizing anomalies. Vision Transformer (ViT)-based image anomaly detection and localization networks have exhibited remarkable performance improvements in recent years, but their conventional square patch embedding may not be optimal for fine-grained anomaly detection, where anomalies typically exhibit elongated or irregular shapes. To address this problem, we introduce a novel approach that leverages sliced-shaped patches instead of conventional square patches in Vision Transformer (ViT). This approach improves the spatial resolution and ensures more detailed feature representations. State-of-the-art results on two existing industrial anomaly detection benchmarks show that our model effectively captures morphological details and spatial dependencies, thus demonstrating its ability to capture intricate anomaly patterns.
Loading