Learning Event Completeness for Weakly Supervised Video Anomaly Detection

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: A novel dual-structed framework is proposed to learn event completeness for weak-supervised video anomaly detection
Abstract: Weakly supervised video anomaly detection (WS-VAD) is tasked with pinpointing temporal intervals containing anomalous events within untrimmed videos, utilizing only video-level annotations. However, a significant challenge arises due to the absence of dense frame-level annotations, often leading to incomplete localization in existing WS-VAD methods. To address this issue, we present a novel LEC-VAD, Learning Event Completeness for Weakly Supervised Video Anomaly Detection, which features a dual structure designed to encode both category-aware and category-agnostic semantics between vision and language. Within LEC-VAD, we devise semantic regularities that leverage an anomaly-aware Gaussian mixture to learn precise event boundaries, thereby yielding more complete event instances. Besides, we develop a novel memory bank-based prototype learning mechanism to enrich concise text descriptions associated with anomaly-event categories. This innovation bolsters the text's expressiveness, which is crucial for advancing WS-VAD. Our LEC-VAD demonstrates remarkable advancements over the current state-of-the-art methods on two benchmark datasets XD-Violence and UCF-Crime.
Lay Summary: Imagine you’re monitoring security cameras but only know if something unusual happened in a long video, not when or what exactly. This is the challenge of weakly supervised video anomaly detection (WS-VAD). Most existing methods struggle to precisely locate anomalies because they lack detailed, frame-by-frame labels. Our solution, LEC-VAD, acts like a detective trained to notice both broad "vibes" (e.g., "something’s off") and specific clues (e.g., "a fight breaks out"). LEC-VAD uses two smart tools: 1. Anomaly "GPS": By modeling anomalies as a mix of typical patterns (like a Gaussian “blur” of possibilities), LEC-VAD sharpens the edges of suspicious events, avoiding vague or incomplete guesses. 2. Text Memory Bank: It stores concise descriptions of anomalies (e.g., "explosion," "loitering") to refine how the model "talks" about events, making detection sharper. By combining visual clues with language, LEC-VAD outperforms current methods on real-world datasets like XD-Violence and UCF-Crime. This approach could improve surveillance systems, content moderation, or even self-driving cars by spotting dangers faster and with less human effort.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Computer Vision
Keywords: Weakly-supervised video anomaly detection, multi-modal video understanding
Submission Number: 873
Loading