Multi-Modal Learning for Speech Emotion Recognition: An Analysis and Comparison of ASR Outputs with Ground Truth Transcription
Abstract: In this paper we plan to leverage multi-modal learning and au-
tomated speech recognition (ASR) systems toward building a
speech-only emotion recognition model. Previous studies have
shown that emotion recognition models using only acoustic fea-
tures do not perform satisfactorily in detecting valence level.
Text analysis has been shown to be helpful for sentiment classi-
fication. We compared classification accuracies obtained from
an audio-only model, a text-only model and a multi-modal sys-
tem leveraging both by performing a cross-validation analysis
on IEMOCAP dataset. Confusion matrices show it’s the va-
lence level detection thats being improved by incorporating tex-
tual information. In the second stage of experiments, we used
two ASR application programming interfaces (APIs) to get the
transcriptions. We compare the performances of multi-modal
systems using the ASR transcriptions with each other and with
that of one using ground truth transcription. We analyze the
confusion matrices to determine the effect of using ASR tran-
scriptions instead of ground truth ones on class-wise accuracies.
We investigate the generalisability of such a model by perform-
ing a cross-corpus study.
0 Replies
Loading