Speaker-discriminative Embedding Learning via Affinity Matrix for Short Utterance Speaker Verification

Junyi Peng; Rongzhi Gu; Yuexian Zou; Wenwu Wang

Speaker-discriminative Embedding Learning via Affinity Matrix for Short Utterance Speaker Verification

Junyi Peng, Rongzhi Gu, Yuexian Zou, Wenwu Wang

Published: 01 Jan 2019, Last Modified: 16 May 2025APSIPA 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Text-independent short utterance speaker verification (TI-SUSV) task remains more challenging compared to the full-length utterance SV task due to inaccurately estimated feature statistics or insufficient distinguishable speaker embeddings. It is noted that recently developed end-to-end SV systems (E2E-SV) achieve the state-of-the-art on several datasets, which directly learn a mapping from speech features to the compact fixed length speaker embeddings. In this study, following the E2E-SV pipeline, we strive to further improve the accuracy of TI-SUSV task. Our research is based on two intuitive ideas: better speech feature representation for SUs and better training loss function to obtain more discriminative embeddings. Specifically, a bidirectional gated recurrent unit network with residual connection (Res-BGRU) is firstly designed to improve feature representation capability. Secondly, a novel affinity loss is proposed where the mini-batch data has been manipulated to obtain more supervision information. In details, a speaker identity affinity matrix formed by one-hot speaker identity vectors is taken as the supervisor of the speaker embedding affinity matrix to obtain better inter-speaker separability and intra-speaker compactness. Experimental results on the Voxceleb1 dataset show that our system outperforms a conventional i-vector and x-vector system on TI-SUSV.

Loading