Cross-Lingual Speech Emotion Recognition with Self-Supervised Models: A Confound-Controlled Comparison

TMLR Paper9566 Authors

07 Jun 2026 (modified: 20 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Speech emotion recognition across languages remains difficult because emotional cues interact with speaker identity, language, and recording conditions. emotion2vec, an emotion-specialized self-supervised learning (SSL) speech model, reports large gains over general-purpose SSL encoders such as HuBERT and WavLM across multiple non-English languages. This paper re-examines that claim under a confound-controlled cross-lingual evaluation. We compare HuBERT, WavLM, and emotion2vec on five emotional speech corpora spanning German, English, Mandarin, and Bangla, with an additional external test on Thai. Across speaker-independent probing, matched per-dataset evaluation, non-EmoBox generalization, and zero-shot/few-shot cross-lingual transfer, the general-purpose encoders consistently match or outperform emotion2vec. In speaker-independent four-class evaluation, emotion2vec ranks last on all five corpora, with significant gaps on four. Under cross-lingual transfer across 12 source-target language pairs, emotion2vec trails general-purpose SSL models by 12 to 18 percentage points in zero-shot transfer and remains about 10 points behind in the few-shot setting with 100 target-language examples per source-target pair. On Thai, a language outside the EmoBox fine-tuning distribution, the fine-tuned emotion2vec variant performs worse than both general SSL models and its own non-fine-tuned base version. We further show that the gap is not explained by speaker identity: after iterative null-space projection removes speaker-discriminative directions, HuBERT and WavLM remain ahead. These results suggest that emotion2vec, the emotion-specialized model whose published cross-lingual per-language results we re-examine, does not transfer across languages as reliably as general-purpose SSL. In practice, HuBERT and WavLM remain strong defaults, reaching 0.78 to 0.93 in-language accuracy on the four-class task and about 0.77 cross-lingual accuracy in the few-shot setting with only 100 target-language labels per source-target pair. Code is available at https://anonymous.4open.science/r/ser-cross-ling-pub-536B
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Xiatian_Zhu3
Submission Number: 9566
Loading