Abstract: Multimodal speech emotion recognition (SER)
and sentiment analysis (SA) are important tech-
niques for human-computer interaction. Most
existing multimodal approaches utilize either
shallow cross-modal fusion of pretrained fea-
tures, or deep cross-modal fusion with raw
features. Recently, attempts have been made
to fuse pretrained feature representations in a
deep fusion manner during fine-tuning stage.
However, those approaches have not led to im-
proved results, partially due to their relatively
simple fusion mechanisms and lack of proper
cross-modal pretraining. In this work, leverag-
ing single-modal pretrained models (RoBERTa
and HuBERT), we propose a novel deeply-
fused audio-text bi-modal transformer with
carefully designed cross-modal fusion mech-
anism and a stage-wise cross-modal pretrain-
ing scheme to fully facilitate the cross-modal
learning. Our experiment results show that the
proposed method achieves state-of-the-art re-
sults on the public IEMOCAP emotion and
CMU-MOSEI sentiment datasets, exceeding
the previous benchmarks by a large margin.
0 Replies
Loading