Abstract: The effective fusion of multiple modalities (i.e., text, acoustic, and visual) is a non-trivial task, as these modalities often carry specific and diverse information and do not contribute equally. The fusion of different modalities could even be more challenging under the low-resource setting, where we have fewer samples for training. This paper proposes a multi-representative fusion mechanism that generates diverse fusions with multiple modalities and then chooses the best fusion among them. To achieve this, we first apply convolution filters on multimodal inputs to generate different and diverse representations of modalities. We then fuse pairwise modalities with multiple representations to get the multiple fusions. Finally, we propose an attention mechanism that only selects the most appropriate fusion, which eventually helps resolve the noise problem by ignoring the noisy fusions. We evaluate our proposed approach on three low-resource multimodal sentiment analysis datasets, i.e., YouTube, MOUD, and ICT-MMMO. Experimental results show the effectiveness of our proposed approach with the accuracies of 59.3%, 83.0%, and 84.1% for the YouTube, MOUD, and ICT-MMMO datasets, respectively.
0 Replies
Loading