GLoMo: Global-Local Modal Fusion for Multimodal Sentiment Analysis

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multimodal Sentiment Analysis (MSA) has witnessed remarkable progress and gained increasing attention in recent decades, thanks to the advancements in deep learning. However, current MSA methodologies primarily rely on global representation extracted from different modalities, such as the mean of $all$ token representations, to construct sophisticated fusion networks. These approaches often overlook the valuable details present in local representations, which consist of fused representations of consecutive $several$ tokens. Additionally, the integration of multiple local representations and the fusion of local and global information present significant challenges. To address these limitations, we propose the Global-Local Modal (GLoMo) Fusion framework. This framework comprises two essential components: (i) modality-specific mixture of experts layers that integrate diverse local representations within each modality, and (ii) a global-guided fusion module that effectively combine global and local representations. The former component leverages specialized expert networks to automatically select and integrate crucial local representations from each modality, while the latter ensures the preservation of global information during the fusion process. We extensively evaluate GLoMo on various datasets, encompassing tasks in multimodal sentiment analysis, multimodal humor detection, and multimodal emotion recognition. Empirical results demonstrate that GLoMo outperforms existing state-of-the-art models, validating the effectiveness of our proposed framework.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Media Interpretation
Relevance To Conference: Multimodal Sentiment Analysis (MSA) has emerged as a vital field of study, aiming to decipher human emotions through the integration of diverse signals. This research aligns perfectly with the scope of the ACM MM conference, which emphasizes the importance of Multimodal Fusion and Multimedia Interpretation. GLoMo, our proposed framework, contributes to the ACM MM domain by addressing the challenges inherent in multimodal sentiment analysis. It combines modality-specific local representations with global representations, thereby enriching the interpretability and accuracy of emotion recognition across various multimedia content. The mixture of experts layer within GLoMo effectively captures the intricate details of local information from each modality, which traditional global-centric approaches often overlook. This ensures a more nuanced understanding of sentiment by considering the context and subtleties of multimodal cues. Moreover, the global-guided fusion modules in GLoMo facilitate the integration of local and global features, enhancing the robustness of sentiment prediction. By outperforming state-of-the-art models in tasks like multimodal emotion recognition (MER), multimodal sentiment analysis (MSA), and multimodal humor detection (MHD), GLoMo sets a new benchmark for multimedia processing. The framework's improving performance exemplifies a practical advancement in the efficient processing of multimedia data.
Supplementary Material: zip
Submission Number: 4508
Loading