Abstract: Audio-visual fusion methods are widely employed to tackle violence detection tasks, since they can effectively integrate the complementary information from both modalities to significantly improve accuracy. However, the design of high-quality multimodal fusion networks is highly dependent on expert experience and substantial efforts. To alleviate this formidable challenge, we propose a novel method named Violence-MFAS, which can automatically design promising multimodal fusion architectures for violence detection tasks using multimodal fusion architecture search (MFAS). To further enable the model to focus on important information, we elaborately design a new search space. Specifically, multilayer neural networks based on attention mechanisms are meticulously constructed to grasp intricate spatio-temporal relationships and extract comprehensive multimodal representation. Finally, extensive experiments are conducted on the commonly used large-scale and multi-scene audio-visual XD-Violence dataset. The promising results demonstrate that our method outperforms the state-of-the-art methods with less parameters.
Loading