Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion GraphDownload PDF

17 May 2023OpenReview Archive Direct UploadReaders: Everyone
Abstract: Analyzing human multimodal language is an emerging area of research in NLP. In- trinsically human communication is mul- timodal (heterogeneous), temporal and asynchronous; it consists of the language (words), visual (expressions), and acoustic (paralinguistic) modalities all in the form of asynchronous coordinated sequences. From a resource perspective, there is a gen- uine need for large scale datasets that al- low for in-depth studies of multimodal lan- guage. In this paper we introduce CMU Multimodal Opinion Sentiment and Emo- tion Intensity (CMU-MOSEI), the largest dataset of sentiment analysis and emo- tion recognition to date. Using data from CMU-MOSEI and a novel multimodal fu- sion technique called the Dynamic Fusion Graph (DFG), we conduct experimentation to investigate how modalities interact with each other in human multimodal language. Unlike previously proposed fusion tech- niques, DFG is highly interpretable and achieves competitive performance com- pared to the current state of the art.
0 Replies

Loading