Audio-visual mutual learning for Weakly Supervised Violence Detection

Jialiang Cheng, Chao Sun, Jincai Chen, Ping Lu

Published: 2023, Last Modified: 06 Mar 2025ICISE 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the development of smart surveillance systems, multimodal violence video data provides enough data source for smart surveillance systems. At present, intelligent violence detection faces two main challenges. One of the important challenges to apply weakly supervised learning for violence detection is to accurately identify normal segments of abnormal videos, and another challenge is how to make full use of visual and audio features of videos. Therefore, we explored methods for fusing visual and audio information together with temporal information. We propose a novel neural network containing three parts: 1) co-attention module fusing audio and video features with LSTM to extract temporal information, 2) mutual learning branches with threshold to generate high-quality pseudo labels, and 3) a simple and effective post-processing method to ensure continuity of forecast results. Our experiment results show that the proposed model exceeds the existing state-of-art models on the XD-Violence dataset by 1.92% in AP.