Abstract: Sentiment analysis through multi-modal approaches has shown the potential to outperform uni-modal approaches. One of the challenges in this domain is to effectively model cross-view dynamics from view-specific dynamics. This paper proposes a model that captures both dynamics, and applies attention over the contributing features from each modality, to predict utterance-level sentiments. In the model, the paper introduces a deep learning pipeline called the Cross-view Recurrent Neural Network Pair to compute cross-view dynamics and integrate them with view-specific dynamics, to obtain contextually rich utterance representations. The proposed model is evaluated on CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) and CMU Multi-modal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) datasets. The model achieves an accuracy of 81.78% on CMU-MOSI and 80.45% on CMU-MOSEI.
External IDs:doi:10.1007/978-981-19-8477-8_20
Loading