Abstract: In recent years, vision-language models such as CLIP and VideoLLaMA have demonstrated
the ability to express visual data in semantically rich textual representations, making them
highly effective for downstream tasks. Given their cross-modal semantic representation
power, leveraging such models for video anomaly detection (VAD) holds significant promise.
In this work, we introduce Semantic VAD (SemVAD), a novel methodology for weakly super-
vised video anomaly detection (wVAD) that effectively fuses visual and semantic features
obtained from pretrained vision-language models, specifically VideoLLaMA 3 and CLIP.
Our approach enhances performance and explainability in anomaly detection. Additionally,
we analyze the sensitivity of recent state-of-the-art models to randomness in training initial-
ization and introduce a more comprehensive evaluation framework to assess their robustness
to small changes in training such as the seed of random number generator. This framework
aims to provide a more rigorous and holistic assessment of model performance, ensuring a
deeper understanding of their reliability and reproducibility in wVAD.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Derek_Hoiem1
Submission Number: 5697
Loading