Self-supervised Cross-modal Pretraining for Speech Emotion Recognition and Sentiment Analysis

Published: 01 Jan 2022, Last Modified: 05 May 2025EMNLP (Findings) 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multimodal speech emotion recognition (SER) and sentiment analysis (SA) are important techniques for human-computer interaction. Most existing multimodal approaches utilize either shallow cross-modal fusion of pretrained features, or deep cross-modal fusion with raw features. Recently, attempts have been made to fuse pretrained feature representations in a deep fusion manner during fine-tuning stage. However those approaches have not led to improved results, partially due to their relatively simple fusion mechanisms and lack of proper cross-modal pretraining. In this work, leveraging single-modal pretrained models (RoBERTa and HuBERT), we propose a novel deeply-fused audio-text bi-modal transformer with carefully designed cross-modal fusion mechanism and a stage-wise cross-modal pretraining scheme to fully facilitate the cross-modal learning. Our experiment results show that the proposed method achieves state-of-the-art results on the public IEMOCAP emotion and CMU-MOSEI sentiment datasets, exceeding the previous benchmarks by a large margin.
Loading