Abstract: Speaker verification has been known to be a tough task especially under the condition of short utterances. Based on the observation that actual voice commands are composed of a few repeated words, we propose an effective approach for building and training a deep neural network to extract features with properties appropriate for tackling such condition. We demonstrate the effectiveness through experiments independently designed for each property. Our proposed approach achieves 5.89% equal error rate on word scale commands shorter than 1 second, and with a linear discriminative analysis, it decreases to 3.43%.
Loading