SemVAD: Fusing Semantic and Vision Features for Weakly Supervised Video Anomaly Detection

TMLR Paper5697 Authors

21 Aug 2025 (modified: 19 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: In recent years, vision-language models such as CLIP and VideoLLaMA have demonstrated the ability to express visual data in semantically rich textual representations, making them highly effective for downstream tasks. Given their cross-modal semantic representation power, leveraging such models for video anomaly detection (VAD) holds significant promise. In this work, we introduce Semantic VAD (SemVAD), a novel methodology for weakly super- vised video anomaly detection (wVAD) that effectively fuses visual and semantic features obtained from pretrained vision-language models, specifically VideoLLaMA 3 and CLIP. Our approach enhances performance and explainability in anomaly detection. Additionally, we analyze the sensitivity of recent state-of-the-art models to randomness in training initial- ization and introduce a more comprehensive evaluation framework to assess their robustness to small changes in training such as the seed of random number generator. This framework aims to provide a more rigorous and holistic assessment of model performance, ensuring a deeper understanding of their reliability and reproducibility in wVAD.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Derek_Hoiem1
Submission Number: 5697
Loading