Abstract: Multi-modal emotion recognition aims to recognize different emotions precisely by utilizing different modality information. Current multi-modal methods always suffer the pain of the modality gap. Besides, many works use attention mechanisms to fuse different modalities features, which reduces the diversity of the modalities. To deal with these issues, we propose an inter-modality and intra-sample alignment framework for multi-modal emotion recognition. Specifically, we first map text features and audio features to a shared subspace and align them using a cross-modal self-attention mechanism and Maximum Mean Discrepancy matrix. To preserve the diversity of each modality, we extract modality-specific features using different encoders. Finally, we employ supervised contrastive learning techniques to align the features at a sample-level. Extensive experiments indicate that our method achieves state-of-the-art performance and effectiveness.
Loading