Abstract: Multimodal emotion recognition is gaining significant attention for ability to fuse complementary information from diverse physiological and behavioral signals, which benefits the understanding of emotional disorders. However, challenges arise in multimodal fusion due to uncertainties inherent in different modalities, such as complex signal coupling and modality heterogeneity. Furthermore, the feature distribution drift in intersubject emotion recognition hinders the generalization ability of the method and significantly degrades performance on new individuals. To address the above issues, we propose a cross-subject multimodal emotion robust recognition framework that effectively extracts subject-independent intrinsic emotional identification information from heterogeneous multimodal emotion data. First, we develop a multichannel network with self-attention and cross-attention mechanisms to capture modality-specific and complementary features among different modalities, respectively. Second, we incorporate contrastive loss into the multichannel attention network to enhance feature extraction across different channels, thereby facilitating the disentanglement of emotion-specific information. Moreover, a self-expression learning-based network layer is devised to enhance feature discriminability and subject alignment. It aligns samples in a discriminative space using block diagonal matrices and maps multiple individuals to a shared subspace using a block off-diagonal matrix. Finally, attention is used to merge multichannel features, and multilayer perceptron is employed for classification. Experimental results on multimodal emotion datasets confirm that our proposed approach surpasses the current state-of-the-art in terms of emotion recognition accuracy, with particularly significant gains observed in the challenging cross-subject multimodal recognition scenarios.
External IDs:doi:10.1109/tcds.2025.3552124
Loading