Abstract: In this work, we assess the impact of vocal effort on discrimination and calibration performance of a state-of-the-art speaker recognition system. We analyze three levels of vocal effort (low, normal, and high) from the SRI-FRTIV corpus. We use a deep neural network (DNN) speaker embeddings system with probabilistic linear discriminant analysis (PLDA) and find that vocal effort variation significantly degrades system performance. We apply both mixture PLDA (mix-PLDA) and trial-based calibration with condition PLDA similarity (TBC-CPLDA) to improve system robustness. Our proposed approaches resulted in 18% and 33% relative improvement in discrimination and calibration performance respectively on the SRI-FRTIV corpus.
0 Replies
Loading