TDSD: Text-Driven Scene-Decoupled Weakly Supervised Video Anomaly Detection

Shengyang Sun; Jiashen Hua; Junyi Feng; Dongxu Wei; Baisheng Lai; Xiaojin Gong

TDSD: Text-Driven Scene-Decoupled Weakly Supervised Video Anomaly Detection

Shengyang Sun, Jiashen Hua, Junyi Feng, Dongxu Wei, Baisheng Lai, Xiaojin Gong

Published: 20 Jul 2024, Last Modified: 04 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Video anomaly detection has garnered widespread attention in industry and academia in recent years due to its significant role in public security. However, many existing methods overlook the influence of scenes on anomaly detection. These methods simply label the occurrence of certain actions or objects as anomalous. In reality, scene context plays a crucial role in determining anomalies. For example, running on a highway is anomalous, while running on a playground is normal. Therefore, understanding the scene is essential for effective anomaly detection. In this work, we aim to address the challenge of scene-dependent weakly supervised video anomaly detection by decoupling scenes. Specifically, we propose a novel text-driven scene-decoupled (TDSD) framework, consisting of a TDSD module (TDSDM) and fine-grained visual augmentation (FVA) modules. The scene-decoupled module extracts semantic information from scenes, while the FVA module assists in fine-grained visual enhancement. We validate the effectiveness of our approach by constructing two scene-dependent datasets and achieve state-of-the-art results on scene-agnostic datasets as well. Code is available at https://github.com/shengyangsun/TDSD.

Primary Subject Area: [Content] Media Interpretation

Secondary Subject Area: [Content] Media Interpretation, [Content] Multimodal Fusion

Relevance To Conference: We propose a novel text-driven scene-decoupled (TDSD) framework to address the issue of weakly supervised video anomaly detection, which consists of a TDSD module and a fine-grained visual augmentation (FVA) module. The TDSD comprises the context semantic injection (CSI) module and the object semantic injection (OSI) module. It constructs textual descriptions of scenes and objects within scenes by obtaining scene and object categories, respectively, and then converts the text into semantic features about the scene, which is combined with the visual information from video frames. By decoupling scenes, this paper establishes a connection between text and vision, enabling the model to detect scene-dependent anomaly events.

Supplementary Material: zip

Submission Number: 958

Loading