Diversity–Ambiguity Exploration for Weakly Supervised Video Anomaly Detection

ICLR 2026 Conference Submission20120 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Weakly supervised learning, Video anomaly detection, Anomaly Exploration
Abstract: Weakly supervised learning provides a cost-effective framework to video anomaly detection by using video-level supervision instead of relying on the costly fine-grained segment-level labels. Although contemporary methods have shown promising results on challenging real-world surveillance videos, most of them are evaluated using the Area Under the Receiver Operating Characteristic Curve (AUROC). Our work reveals that a high AUROC could result in a very low recall given a meaningful False Positive Rate (FPR) threshold. Thus, these models suffer from limited practical values, especially in high-stake domains (\eg public safety and medical diagnosis), where missing the true anomalies are highly costly. This surprising phenomenon is rooted in the interplay of weak supervision and the highly imbalanced distribution between normal and abnormal segments. To tackle this key challenge of building practical video anomaly detection systems, we propose a novel dual exploration strategy that combines temporal clustering with uncertainty-based segment exploration. Temporal clustering selects diverse segments based on both semantic and temporal similarity, while uncertainty-based sampling targets low-scoring segments with high model uncertainty. This dual exploration strategy ensures the model learns from a wide range of patterns, both diverse and ambiguous, resulting in more informed and robust decision-making, and reduction in false negatives. Meanwhile, we recommend two practical metrics to replace the commonly used AUROC score for a more effective measure for evaluation. Experiments conducted in challenging real-world videos demonstrate better dual exploration performance compared to competitive baselines on these metrics, which justifies its improved practical value in real-world settings.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20120
Loading