Exploring Branchformer-Based End-to-End Speaker Diarization with Speaker-Wise VAD Loss

Published: 01 Jan 2024, Last Modified: 30 Jul 2025O-COCOSDA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Speaker diarization involves partitioning an audio stream into segments according to the identity of the speaker. The encoder-decoder based attractors for the end-to-end neural diarization (EEND-EDA) model can handle overlapping speech and has shown promising performance compared to traditional methods. However, EEND-EDA fails to identify the number of speakers accurately. To address this limitation, we first replace the Transformer encoder in EEND-EDA with the Branchformer encoder. Additionally, we introduce speaker-wise VAD Loss (SAD Loss) to the self-attention mechanism of the Branchformer encoder, thereby improving the model's ability to distinguish different speakers. Extensive experimental results on the Mini-Librispeech and simulated dataset Sim2spk benchmark dataset suggest that our approach outperforms existing strong baselines by a substantial margin, achieving a significant improvement of more than 15% Diarization Error Rate (DER). We will release the source code on GitHub 1 for future research.
Loading