Abstract: Verifying a person's identity based on their voice is a challenging, real-world problem in biometric security. A crucial requirement of such speaker verification systems is to be domain robust. Performance should not degrade even if speakers are talking in languages not seen during training. To this end, we present a flexible and interpretable framework for learning domain invariant speaker embeddings using Generative Adversarial Networks. We combine adversarial training with an angular margin loss function, which encourages the speaker embedding model to be discriminative by directly optimizing for cosine similarity between classes. We are able to beat a strong baseline system using a cosine distance classifier and a simple score-averaging strategy. Our results also show that models with adversarial adaptation perform significantly better than unadapted models. In an attempt to better understand this behavior, we quantitatively measure the degree of invariance induced by our proposed methods using Maximum Mean Discrepancy and Frechet distances. Our analysis shows that our proposed adversarial speaker embedding models significantly reduce the distance between source and target data distributions, while performing similarly on the former and better on the latter.
TL;DR: Speaker verificaiton performance can be significantly improved by adapting the model to in-domain data using Generative Adversarial Networks. Furthermore, the adaptation can be performed in an unsupervised way.
Keywords: Speaker Verification, Generative Adversarial Networks, Domain Adaptation