Abstract: Multimodal sentiment analysis (MSA) aims to accurately predict the user’s emotional tendency by integrating multimodal data such as text, audio and visual information posted by users. However, in most prior studies, the information differences between modalities are not fully considered as these modalities are treated equally. In fact, text modality usually contains more richer semantics to exhibit emotion, while audio and visual ones cover more redundant and even noisy information. Though some works have highlighted the key role of text modality, they still fail to fully utilize the non-text modalities, leading to sub-optimal performance for MSA. To this end, we propose a novel Text-centric Bidirectional Modality Enhancement Network (TB-MEN) in this paper, which recognizes the core role of text and fully exploits the audio and visual modalities at the same time. First, in light of the text features extracted by BERT and non-text features extracted by LSTM respectively, we develop a trimodal multi-scale bottleneck fusion (TMBF) module to capture semantically enhanced text features. Specifically, this is achieved by bottleneck fusion mechanism and transferring the audio and visual modalities to text one. Furthermore, for non-text modalities enhancement, a sparse subspace alignment strategy is employed in text-dominant subspace alignment (TDSA) module, which sparsely maps non-text features into the text subspace to aggregate semantic information. Experimental results show that on several public datasets, including CMU-MOSI, CMU-MOSEI and CH-SIMS, our model has achieved significant gains over the state-of-the-arts. For example, on the CMU-MOSI dataset, TB-MEN improves Acc-2, Acc-5, Acc-7, and F1 by 0.45$\%$, 2.04$\%$, 2.62$\%$, and 0.4$\%$, respectively, compared to the second-best one.
External IDs:doi:10.1109/taslpro.2026.3660469
Loading