Advancing Active Speaker Detection for Egocentric Videos

Published: 2025, Last Modified: 11 Nov 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper presents an improved approach to multimodal active speaker detection in egocentric videos, specifically designed to be robust against the rapid movements and motion blur commonly found in such videos. We propose two key techniques to improve the model’s resilience: (i) spatially fixing the lip region in the visual input, and (ii) applying motion blur augmentation. These methods significantly enhance the model’s performance in handling the challenges typical of egocentric videos. We showcase the effectiveness of these techniques on a simple but efficient causal audio-visual model. The proposed model, named EgoASD, demonstrates state-of-the-art performance on the EasyCom dataset, beating the previous SOTA by 1.7% mean Average Precision (mAP) with a model 2.5 times smaller. Our ablations highlight the importance of visual input, motion blur augmentation, the pretraining method and the importance of temporal context. To demonstrate its applicability in the real world, we apply our model to audio-visual speaker diarization, outperforming other baselines on EasyCom.
Loading