Violent Video Recognition Based on Global-Local Visual and Audio Contrastive Learning

Zihao Liu, Xiaoyu Wu, Shengjin Wang, Yimeng Shang

Published: 01 Jan 2024, Last Modified: 14 May 2025IEEE Signal Process. Lett. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The aim of the violent recognition task is to determine whether a video contains violent behaviors. Given that violent behavior often comes with visual and audio anomalies, multimodal approaches have always played an important role in this field. However, existing methods have been limited by the insufficient utilization of audio-visual self-supervised semantic cues and correlation, resulting in a restricted representational capacity of the network and low generalization due to the scarcity of available violent video datasets. To address this issue, we propose a violent action recognition model based on global-local visual and audio contrastive learning. Our model introduces global and local contrastive objectives to achieve audio-visual multi-grained semantic alignment and leverage the correlation for violent video recognition. Experimental results demonstrate that our proposed model improves state-of-the-art by 2.31% on the VSD dataset, 0.71% on the Violent-Flows dataset, and 1.43% on the VCD dataset.