Cascaded Adversarial Learning for Speaker Independent Emotion Recognition

Chamara Kasun Liyanaarachchi Lekamalage, Zhiping Lin, Guang-Bin Huang, Jagath Chandana Rajapakse

Published: 2022, Last Modified: 06 Nov 2025IJCNN 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In contrast to traditional adversarial learning (AL) which learns speaker-invariant representations, this paper proposes cascaded adversarial learning (CAL) which learns speaker-invariant emotion data for speaker independent emotion recognition (SIER) tasks. CAL is a dual cascaded network architecture where the output of the transformation network is fed as input to the classification network. Transformation network transforms original speech emotion to speaker-invariant emotion data by implementing an AL strategy with an encoder-decoder architecture. The classification network predicts the emotion from the speaker-invariant emotion data (output of the transformation network). We argue that the speaker-invariant emotion data realized by transformation network has less variation than the original speech emotion data and therefore are conducive for SIER as it improve generalization capability. To our knowledge this is the first time a dual cascaded network has been used for SIER and demonstrate state-of-the-art performances for SIER on Emo-DB and RAVDESS datasets.