Keywords: Video Anomaly Detection, Vision–Language Models (VLMs), categorical view, Unsupervised Anomalous Period Detector (UAPD), Category-based Atom Miner (CAM)
Abstract: Video anomaly detection (VAD) seeks to identify events that deviate from learned normality. Current Vision–Language Models (VLMs) face significant challenges: anomalies are rare, labels are weak, and visual appearance varies drastically. Mainstream VLMs directly map visual features to events, they overfit to intermediate incidental cues which are present during training and generalize poorly. To address this issue, we propose a categorical view of anomaly understanding that replaces this "visual features to event" mapping with a "visual features to learnable atoms, then to event" framework that models direct, indirect, and counter evidence cues. Firstly, an Unsupervised Anomalous Period Detector (UAPD) is proposed to identify abnormal periods. Next, a Category-based Atom Miner (CAM) is proposed to map visual features to learned atoms in video segments, and learn the roles of atoms. In inference, CAM provides role-aware indications to VLM which maps meaningful atoms and visual features to event predictions. This framework harnesses meaningful evidence and preserves the generalization capacity of VLMs. Extensive experiments and ablations show consistent gains over strong vision‑only and fine‑tuned VLM baselines.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8757
Loading