Cog-VADU: A Training-Free Cognitive Reasoning Framework for Video Anomaly Detection and Understanding
Abstract: Video Anomaly Detection (VAD) aims to temporally localize abnormal events in videos.
Most existing approaches rely on dataset-specific training and curated annotations, limiting generalization in open-set scenarios.
Recent zero-shot methods based on Large Vision-Language Models (LVLMs) alleviate this dependency but often lack temporal continuity and structured reasoning. We propose \textbf{Cog-VADU}, a fully training-free framework that reformulates VAD as a sequential cognitive reasoning task.
Cog-VADU introduces \emph{Chain-of-Anomaly Detection Thought Prompting} (CoADTP), which unrolls an LVLM into a recurrent reasoning chain across video segments.
By propagating structured rationales over time, the model maintains implicit temporal memory, enabling robust discrimination between complex anomalies and high-motion normal activities. To improve reliability, we further design a cross-modal re-ranking stage that aligns textual rationales with visual embeddings, enforcing semantic consistency and temporal coherence for refined and stable predictions. Extensive experiments on multiple public VAD benchmarks demonstrate that Cog-VADU achieves competitive zero-shot performance. Moreover, cross-model evaluations show that CoADTP consistently enhances reasoning-based anomaly detection in a model-agnostic manner, providing interpretable and generalizable anomaly understanding for real-world applications.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: N/A
Assigned Action Editor: ~Lu_Jiang1
Submission Number: 8561
Loading