Multimodal active speaker detection using cross-attention and contextual information

Bogdan Mocanu, Ruxandra Tapu

Published: 2024, Last Modified: 01 Mar 2026ICCE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: An active speaker detection (ASD) framework is aimed to identify whether an on-screen person is speaking or not in each frame of the video. In this paper, we introduce a novel ASD system by mindful integration of audio-video cues through a cross-attention module to capture inter-modal information while retaining the distinct intra-modal features. Furthermore, the system models the inter-speaker relations between the speakers within the same scene. The experimental evaluation validates the effectiveness of the approach, achieving an average mAP score of 94.8%.

External IDs:dblp:conf/iccel/MocanuT24