JTMA: Joint Multimodal Feature Fusion and Temporal Multi-head Attention for Humor Detection

Qi Li, Yangyang Xu, Zhuoer Zhao, Shulei Tang, Feixiang Zhang, Ruotong Wang, Xiao Sun, Meng Wang

Published: 01 Jan 2023, Last Modified: 22 May 2025MuSe@ACM Multimedia 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we propose a model named Joint multimodal feature fusion and Temporal Multi-head Attention (JTMA) to solve the MuSe-Humor sub-challenge in Multimodal Sentiment Analysis Challenge 2023. The goal of MuSe-Humor sub-challenge is to predict whether humor occurs in the given dataset that includes data from multiple modalities (e.g., text, audio and video). The cross-cultural testing presents a new challenge that makes it different from the previous years. To solve the above problems, the proposed model JTMA firstly uses a 1-D CNN to aggregate temporal information within the unimodal feature. Then the interactions of inter-modality and intra-modality are performed by the multimodal feature encoder module. Finally, we integrate the high-level representations learned from multiple modalities to accurately predict humor. The effectiveness of our proposed model is demonstrated through experimental results obtained on the official test set. Our model achieves an impressive AUC score of 0.8889, surpassing the performance of all other participants in the competition, and securing the Top 1 ranking.