Adversarial Domain Generalized Transformer for Cross-Corpus Speech Emotion Recognition

Published: 2024, Last Modified: 22 Jan 2026IEEE Trans. Affect. Comput. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Speech emotion recognition (SER) promotes the development of intelligent devices, which enable natural and friendly human-computer interactions. However, the recognition performance of existing approaches is significantly reduced on unseen datasets, and the lack of sufficient training data limits the generalizability of deep learning models. In this article, we analyze the impact of the domain generalization method on cross-corpus SER and propose an adversarial domain generalized transformer (ADoGT), which is aimed at learning a shared feature distribution for the source and target domains. Specifically, we investigate the effect of domain adversarial learning by eliminating nonaffective information. We also combine the center loss with the softmax function as joint supervision to learn discriminative features. Moreover, we introduce unsupervised transfer learning to extract additional features, and incorporate a gated fusion model to learn the complementary information of the features learned by the supervised feature extractor and pretrained model. The proposed transformer based domain generalization method is evaluated using four emotional datasets. We also provide an ablation study of different domain adversarial model structures and feature fusion models. The results of comparative experiments demonstrate the effectiveness of the proposed ADoGT.
Loading