Modality-Free Violence Detection via Cross-Modal Causal Attention and Feature Distillation

Jiaxu Leng, Zhanjie Wu, Mengjingcheng Mo, Mingpi Tan, Shuang Li, Xinbo Gao

Published: 2024, Last Modified: 06 Nov 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we propose a novel framework, Modality-Free Violence Detection (MFVD), which captures the causal relationships among multimodal cues and ensures stable performance even in the absence of audio information. Specifically, we design a novel Cross-Modal Causal Attention mechanism (CCA) to deal with modality asynchrony by utilizing relative temporal distance and semantic correlation to obtain causal attention between audio and visual information instead of merely calculating correlation scores between audio and visual features. Moreover, to ensure our framework can work well when the audio modality is missing, we design a Cross-Modal Feature Distillation module (CFD), leveraging the common parts of the fused features obtained from CCA to guide the enhancement of visual features. Experimental results on the XD-Violence dataset demonstrate the superior performance of the proposed method in both vision-only and audio-visual modalities, surpassing state-of-the-art methods for both tasks.