- Keywords: multimodal learning, multimodal temporal data, multimodal fusion
- TL;DR: a recurrent multistage fusion model to learn from spatiotemporal data with applications to multimodal language analysis
- Abstract: Computational modeling of human multimodal language is an emerging research application of spatiotemporal modeling spanning the language, visual and acoustic modalities. Comprehending multimodal language requires modeling not only the spatial interactions within each modality (intra-modal interactions) but more importantly the interactions between modalities (cross-modal interactions) from complex temporal data. We propose the Recurrent Multistage Fusion Network (RMFN) which decomposes the spatiotemporal fusion problem into multiple stages, each of them focused on a subset of multimodal signals for specialized, effective fusion. Spatial cross-modal interactions are modeled using this multistage fusion approach which builds upon intermediate representations of previous stages. Temporal and intra-modal interactions are modeled by integrating our proposed fusion approach with a system of recurrent neural networks. The RMFN displays state-of-the-art performance in modeling multimodal language across three tasks relating to multimodal sentiment analysis, emotion recognition, and speaker traits recognition. Experiments show that each stage of fusion focuses on a different subset of multimodal signals and learns increasingly discriminative representations.