Abstract: Multi-modal emotion recognition has attracted much attention in human-computer interaction, because it provides complementary information for the recognition model. However, the distribution drift among subjects and the heterogeneity of different modalities pose challenges to multi-modal emotion recognition, thereby limiting its practical application. Most of the current multi-modal emotion recognition methods are difficult to suppress above uncertainties in fusion. In this paper, we propose a cross-subject multi-modal emotion recognition framework, which jointly learns subject-independent representation and common feature between EEG and eye movements. First, we design the dynamic adversarial domain adaptation for cross-subject distribution alignment, dynamically selecting source domains in training. Second, we simultaneously capture intra-modal and inter-modal emotion-related features by both self-attention and cross-attention mechanisms, thus obtaining the robust and complementary representation of emotional information. Then, two contrastive loss functions are imposed on above network to further reduce inter-modal heterogeneity, and mine higher-order semantic similarity between synchronously collected multi-modal data. Finally, we used the output of the softmax layer as the predicted value. The experimental results on several multi-modal emotion datasets with EEG and eye movements demonstrate that our method is significantly superior to the state-of-the-art emotion recognition approaches.
External IDs:doi:10.1109/taffc.2025.3554399
Loading