Abstract: Self-supervised contrastive learning has recently emerged as one of the promising approaches in speaker verification task, due to its independence from labeled data. Among them, the DINO-based self-supervised framework, trained without exploiting negative pairs, is very popular and achieves excellent performance in the speaker verification task. However, limited by the duration of utterance, there exist many overlaps which may mislead the model to pay attention to irrelevant information. To tackle this problem, we propose a cluster-aware (CA) training strategy to make the model crop positive segments from several utterances in the same cluster rather than from a single utterance. Besides, in the clustering stage, we also investigate strategies of fixed number clustering as well as progressive clustering. With these strategies, our CA-DINO achieves the state-of-the-art result on Vox-O test set. Finally, we explore the effect of fine-tuning CA-DINO with a small amount of labeled data. Our proposed model with only 10% labeled data outperforms the fully supervised system trained on all data.
Loading