Abstract: Zero-Shot Anomaly Detection (ZSAD) is an emerging AD
paradigm. Unlike the traditional unsupervised AD setting
that requires a large number of normal samples to train a
model, ZSAD is more practical for handling data-restricted
real-world scenarios. Recently, Multimodal Large Language
Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of
image abnormalities remains underexplored due to the lack
of corresponding datasets and benchmarks. To facilitate
research in AD & reasoning, we establish the first visual
instruction tuning dataset, Anomaly-Instruct-125k, and the
evaluation benchmark, VisA-D&R. Through investigation
with our benchmark, we reveal that current MLLMs like
GPT-4o cannot accurately detect and describe fine-grained
anomalous details in images. To address this, we propose
Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human
behavior in visual inspection, Anomaly-OV leverages a LookTwice Feature Matching (LTFM) mechanism to adaptively
select and emphasize abnormal visual tokens. Extensive
experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both
detection and reasoning. Extensions to medical and 3D AD
are provided for future study. The link to our project page:
https://xujiacong.github.io/Anomaly-OV/
Loading