Self-supervised Cross-modal Pretraining for Speech Emotion Recognition and Sentiment Analysis

Peng Chang

17 Nov 2022OpenReview Archive Direct UploadReaders: Everyone

Abstract: Multimodal speech emotion recognition (SER) and sentiment analysis (SA) are important tech- niques for human-computer interaction. Most existing multimodal approaches utilize either shallow cross-modal fusion of pretrained fea- tures, or deep cross-modal fusion with raw features. Recently, attempts have been made to fuse pretrained feature representations in a deep fusion manner during fine-tuning stage. However, those approaches have not led to im- proved results, partially due to their relatively simple fusion mechanisms and lack of proper cross-modal pretraining. In this work, leverag- ing single-modal pretrained models (RoBERTa and HuBERT), we propose a novel deeply- fused audio-text bi-modal transformer with carefully designed cross-modal fusion mech- anism and a stage-wise cross-modal pretrain- ing scheme to fully facilitate the cross-modal learning. Our experiment results show that the proposed method achieves state-of-the-art re- sults on the public IEMOCAP emotion and CMU-MOSEI sentiment datasets, exceeding the previous benchmarks by a large margin.

0 Replies