Abstract: Recent works suggest that decoupling the information of non-target speakers from that of the target speaker in knowledge distillation (KD) and subsequently emphasizing the former can lead to significant performance improvement. However, a well-trained teacher model typically produces almost zero non-target speaker posteriors with limited contribution to knowledge transfer, resulting in a less effective KD. To address this problem, we advocate a dual-group knowledge distillation framework, wherein the primary group with top-k speaker posteriors captures most of the speaker discrimination knowledge in an utterance. The non-primary group contributes to the KD through a binary classification (distillation) between the primary and non-primary groups. In addition, adaptive logit softening is proposed to adjust the teacher’s and student’s logits in the binary distillation, further facilitating effective knowledge transfer. The proposed method trained with a simple x-vector pipeline obtains an impressive equal error rate of 1.46%, 1.47%, and 2.70% on three VoxCeleb1 test sets, outperforming the state-of-the-art methods with a noticeable margin.
External IDs:dblp:conf/icassp/GanTJML25
Loading