TsAFN: A two-stage adaptive fusion network for multimodal sentiment analysis

Published: 01 Jan 2025, Last Modified: 28 Oct 2025Appl. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multimodal sentiment analysis (MSA) provides a more accurate understanding of human emotional states than unimodal. However, the different modalities are limited by semantic expression in expressing emotion, leading to inconsistency in the importance of unimodal influence on the fused modal sentiment polarity, as well as sentiment polarity biases resulting from the interaction between multiple modalities. This can make MSA less accurate. To address this problem, we propose a two-stage adaptive fusion network (TsAFN) in this paper. The first stage is an adaptive fusion network based on the joint of modal features. Feature extraction is based on Bert and LSTM network. An importance metric adaptive benchmark is presented for proposing a feature planning method to jointly represent multimodal features to form fused modal features, which automatically equalizes the importance of unimodal influence on the fused modal sentiment polarity. The second stage is an adaptive fusion network based on modal interaction. A distance metric adaptive benchmark is defined, based on which a representation reconstruction method is proposed to take into account inter-modal interactions. The relationship and sentiment polarity biases of the modalities are adjusted to reconstruct unimodal sentiment polarity and a more accurate representation of the fused modality. Finally, the loss function is defined and the model is trained on three datasets MOSI, MOSEI, and CH-SIMS. The results of comparative experiments show that TsAFN can achieve better accuracy in MSA.
Loading