CLIP-Driven Multi-Scale Instance Learning for Weakly Supervised Video Anomaly Detection

Published: 01 Jan 2024, Last Modified: 15 Jul 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Existing weakly supervised video anomaly detection methods mainly employ Multiple Instance Learning (MIL) to identify abnormal snippets in untrimmed videos. However, the semantics and presentations of anomalies frequently exhibit ambiguity that MIL is difficult to tackle. Moreover, MIL suffers from false alarms due to its independent optimization of each instance, neglecting temporal correlation between adjacent snippets. Consequently, we badly need to better connect abnormal presentations and their semantics, as well as to enable multi-temporal-scale anomaly discovery. This paper proposes a CLIP-Driven Multi-Scale Instance Learning (CMSIL) framework with two branches including Vision-Language (VL) and Multi-Scale Instance Learning (MSIL). The VL branch leverages the powerful visual concept priors from Contrastive Language-Image Pre-training (CLIP) to generate pseudo anomalies, thereby providing suspected anomaly cues for model training guidance. The MSIL branch utilizes a feature pyramid to fully mine fine-grained temporal dependencies by employing MIL within each pyramid level to learn anomalous patterns across different temporal scales. By collaborating with the two branches, the proposed CMSIL shows better proficiency in handling anomalies with varying durations. Extensive experiments on the XD-Violence and UCF-Crime datasets demonstrate the superior performance of our method. The code is available at https://github.com/casperZB/CMSIL.
Loading