Abstract: In recent years, there has been a surge in the use of a pre-trained speech model as a feature extractor for speaker verification (SV). To reduce model complexity, researchers transfer knowledge from a pre-trained model to a lightweight student model, enabling the latter to reach a performance level not attainable by conventional methods. However, due to the differences in model capacity, the student features contain more noise. This results in discrepancies between the teacher and student features at the intermediate layers, negatively impacting feature-level knowledge distillation (KD). To address this issue, we employ a diffusion model to denoise the student features for KD (DenoKD). This approach enables more effective feature-level distillation. Our method, trained with a small ECAPA-TDNN, achieved a 13% improvement over the baseline on the VoxCeleb1-O test set. Further more, the DenoKD mechanism is found to be effective for SV on short test utterances.
External IDs:dblp:conf/icassp/JinTLHGM25
Loading