Multimodal-Attention Fusion for the Detection of Questionable Content in Videos

Arnold Morales, Elaheh Baharlouei, Thamar Solorio, Hugo Jair Escalante

Published: 2024, Last Modified: 13 Nov 2024MCPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We address the problem of questionable content filtering from videos, in particular, we focus on the detection of comic mischief. Attention-based models have been proposed to approach this problem, mostly relying on hierarchical cross-attention (HCA) for fusing multimodal information. While competitive performance has been obtained with such solutions, it is unclear whether the hierarchical mechanism is the best choice for this type of model. We explore in this paper the use of an alternative mechanism called parallel cross-attention (ParCA). Also, we propose the use of gated multimodal units (GMU) for fusing multiple multimodal attention mechanisms, besides the traditional concatenation. Experimental results show that the combination of parallel cross-attention and the use GMU improves considerably the performance of the reference model based on HCA.