Abstract: We present our system description of input-level
multimodal fusion of audio, video, and text for
recognition of emotions and their intensities for
the 2018 First Grand Challenge on Computational
Modeling of Human Multimodal Language. Our
proposed approach is based on input-level feature
fusion with sequence learning from Bidirectional
Long-Short Term Memory (BLSTM) deep neural
networks (DNNs). We show that our fusion approach outperforms unimodal predictors. Our system performs 6-way simultaneous classification
and regression, allowing for overlapping emotion
labels in a video segment. This leads to an overall binary accuracy of 90%, overall 4-class accuracy of 89.2% and an overall mean-absolute-error
(MAE) of 0.12. Our work shows that an early fusion technique can effectively predict the presence
of multi-label emotions as well as their coarsegrained intensities. The presented multimodal approach creates a simple and robust baseline on this
new Grand Challenge dataset. Furthermore, we
provide a detailed analysis of emotion intensity
distributions as output from our DNN, as well as
a related discussion
0 Replies
Loading