Late Audio-Visual Fusion for in-the-Wild Speaker Diarization

Zexu Pan, Gordon Wichern, François G. Germain, Aswin Shanmugam Subramanian, Jonathan Le Roux

Published: 01 Jan 2024, Last Modified: 04 Nov 2024ICASSP Workshops 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Speaker diarization has been well studied for constrained scenarios but little explored for in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers. We address this gap by proposing an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we improve the attractor-based end-to-end system EEND-EDA with an attention mechanism and a speaker recognition loss to handle the larger speaker number and retain the speaker identity across recordings. The visual-centric sub-system leverages facial attributes and lip-audio synchrony for identity and speech activity estimation of on-screen speakers. Both sub-systems surpass the state of the art (SOTA) by a wide margin, with the fused audiovisual system achieving a new SOTA on the AVA-AVD benchmark.