Abstract: Video anomaly detection (VAD) is an important intelligent system application, but most current research views it as a coarse binary classification task that lacks a fine-grained understanding of abnormal video sequences. We explore a new task for video anomaly analysis called Comprehensive Video Anomaly Caption (CVAC), which aims to generate comprehensive textual captions (containing scene information such as time, location, anomalous subject, anomalous behavior, etc.) for surveillance videos. CVAC is more consistent with human understanding than VAD, but it has not been well explored. We constructed a large-scale benchmark CVACBench to lead this research. For each video clip, we provide 6 fine-grained annotations, including scene information and abnormal keywords. A new evaluation metric Abnormal-F1 (A-F1) is also proposed to more accurately evaluate the caption generation performance of the model. We also designed a method called Anomaly-Led Generating Prompting Transformer (AGPFormer) as a baseline. In AGPFormer, we introduce an anomaly-led language modeling mechanism (Anomaly-Led MLM, AMLM) to focus on anomalous events in videos. To achieve more efficient cross-modal semantic understanding, we design the Interactive Generating Prompting (IGP) module and Scene Alignment Prompting (SAP) module to explore the divide between video and text modalities from multiple perspectives, and to improve the model’s performance in understanding and reasoning about the complex semantics of videos. We conducted experiments on CVACBench by using traditional caption metrics and the proposed metrics, and the experimental results demonstrate the effectiveness of AGPFormer in the field of anomaly caption.
External IDs:dblp:journals/tmm/BaoLJLLLLWC25
Loading