Abstract: Multimodal sentiment analysis (MSA) is gaining traction as a critical tool for understanding human behavior and enabling a wide range of applications. Since data of different modalities might lie in completely distinct spaces, it is very challenging to perform effective fusion and analysis from asynchronous multimodal streams. Most of the previous works focused on aligned fusion, which is unpractical in real-world scenarios. The recent Multimodal Transformer (MulT) approach attends to model the correlations between elements from different modalities in an unaligned manner. However, it collects temporal information by self-attention transformer which is a sequence model, implying that interactions across distinct time steps are not sufficient. In this paper, we propose the Citculantinteractive Transformer Network with dimension-aware fusion (CITN-DAF), which enables parallel computation of different modalities among different time steps and alleviates intermodal temporal sensitivity while preserving intra-modal semantic order. By incorporating circulant matrices into the cross-modal attention mechanism, CITN-DAF is aimed to examine all conceivable interactions between vectors of different modalities. In addition, a dimension-aware fusion method is presented, which projects feature representations into different subspaces for an in-depth fusion. We evaluate CITN-DAF on three commonly used sentiment analysis benchmarks including CMU-MOSEI, CMU-MOSI and IEMOCAP. Extensive experimental results reveal that CITN-DAF is superior in cross-modal semantic interactions and outperforms the state-of-the-art multimodal methods.
0 Replies
Loading