Alignment Losses for End-to-End Speaker Diarization

Simeng Shi, Zhida Song, Zhihua Fang, Xiaochen Guo, Liang He

Published: 2025, Last Modified: 27 Jan 2026ICIC (17) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: End-to-end neural diarization (EEND) is an effective approach to addressing the problem of speaker overlap in speaker diarization. It can directly predict the posterior probability of speech activity for each speaker based on the input audio features. The self-attention EEND (SA-EEND) has received significant attention and has laid a robust foundation for the development of EEND systems in recent years. To further enhance the training performance, we introduced the feature alignment into SA-EEND and proposed two loss functions for its training: the audio encoding alignment loss and the speaker activity posterior alignment loss. These losses respectively constrain the encoder and decoder to retain more audio information, thereby improving performance. We performed a series of experiments on both the simulated datasets and the CALLHOME dataset for two-speaker diarization tasks, and observed improvements in the diarization error rate.

External IDs:dblp:conf/icic/ShiSFGH25