CFMISA: Cross-Modal Fusion of Modal Invariant and Specific Representations for Multimodal Sentiment Analysis

Haiying Xia, Jingwen Chen, Yumei Tan, Xiaohu Tang

Published: 01 Jan 2024, Last Modified: 25 Jul 2025PRCV (3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multimodal sentiment analysis aims to identify and understand sentiment using data from different information sources. However, achieving effective fusion of different modalities is challenging due to heterogeneity, information redundancy, and distributional differences between modalities. Previous approaches deal with modal heterogeneity by projecting multiple modalities into a common latent space, but ignore the valuable unique information within each modality, and thus fail to capture rich semantic information of each modality from different perspectives. To address this problem, we propose a framework called CFMISA that effectively learns the commonalities and idiosyncrasies of different modal data, and improves the robustness of sentiment analysis. Specifically, each modality is first mapped to two spaces: a public subspace and a private subspace. Among them, the public subspace contains modality-invariant representations, while the private subspace contains modality-specific representations. Subsequently, we introduce the cross-modal fusion module MCROSS to facilitate effective interaction and fusion between modality-invariant and modality-specific representations, achieving commonality fusion while preserving the diversity of each individual modality. As a result, CFMISA is able to fully exploit the useful information of each modality to achieve more effective fusion and reduce redundant information. Experimental results on the MOSI and MOSEI datasets demonstrate the effectiveness of our proposed CFMISA, outperforming most of existing methods.