Abstract: Speaker diarization involves partitioning an audio stream into segments according to the identity of the speaker. The encoder-decoder based attractors for the end-to-end neural diarization (EEND-EDA) model can handle overlapping speech and has shown promising performance compared to traditional methods. However, EEND-EDA fails to identify the number of speakers accurately. To address this limitation, we first replace the Transformer encoder in EEND-EDA with the Branchformer encoder. Additionally, we introduce speaker-wise VAD Loss (SAD Loss) to the self-attention mechanism of the Branchformer encoder, thereby improving the model's ability to distinguish different speakers. Extensive experimental results on the Mini-Librispeech and simulated dataset Sim2spk benchmark dataset suggest that our approach outperforms existing strong baselines by a substantial margin, achieving a significant improvement of more than 15% Diarization Error Rate (DER). We will release the source code on GitHub 1 for future research.
Loading