Abstract: In contrast to traditional adversarial learning (AL) which learns speaker-invariant representations, this paper proposes cascaded adversarial learning (CAL) which learns speaker-invariant emotion data for speaker independent emotion recognition (SIER) tasks. CAL is a dual cascaded network architecture where the output of the transformation network is fed as input to the classification network. Transformation network transforms original speech emotion to speaker-invariant emotion data by implementing an AL strategy with an encoder-decoder architecture. The classification network predicts the emotion from the speaker-invariant emotion data (output of the transformation network). We argue that the speaker-invariant emotion data realized by transformation network has less variation than the original speech emotion data and therefore are conducive for SIER as it improve generalization capability. To our knowledge this is the first time a dual cascaded network has been used for SIER and demonstrate state-of-the-art performances for SIER on Emo-DB and RAVDESS datasets.
Loading