Abstract: Weakly-supervised video anomaly detection (WS-VAD) aims to identify fine-grained anomalies from sparse video-level labels, which has gained increasing attention in recent years due to its various applications such as disaster warning and public security. Recent studies typically formulate WS-VAD as a multi-instance learning (MIL) problem. However, they neglect the instance creation process and simply apply a uniform temporal pooling (UTP) operation to obtain the training instances, leading to severe anomaly contamination and dilution. In this paper, we emphasize the importance of the instance modeling procedure and propose two simple yet effective modules, i.e., the dynamic segment merging (DSM) module and the retrieval-augmented anomaly restoration (RA2R) module, to tackle the problem from segment-level and feature-level, respectively. We equip various state-of-the-art WS-VAD models with the proposed methods and conduct thorough experiments on the challenging datasets, e.g., UCF-Crime, and XD-Violence. Results demonstrate the proposed method brings consistent performance improvement and establishes new state-of-the-art.
External IDs:doi:10.1109/tcsvt.2025.3546766
Loading