Improving speech emotion recognition using gated cross-modal attention and multimodal homogeneous feature discrepancy learning

Published: 2025, Last Modified: 10 Nov 2025Appl. Soft Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•We introduce WavFusion, a multimodal speech emotion recognition model that builds upon the strength of wav2vec 2.0 and integrates textual and visual modalities to improve the performance of audio-based emotion recognition.•To reduce redundant information during the fusion of modalities, we integrate a specially designed gated cross-modal attention mechanism into the wav2vec 2.0 model. Additionally, we utilize a method for learning multimodal homogeneous feature discrepancies to enhance the model’s discriminative abilities.•Experimental results on two benchmark datasets provide evidence of the effectiveness of our proposed approach. WavFusion outperforms current state-of-the-art methods in speech emotion recognition.
Loading