Abstract: An active speaker detection (ASD) framework is aimed to identify whether an on-screen person is speaking or not in each frame of the video. In this paper, we introduce a novel ASD system by mindful integration of audio-video cues through a cross-attention module to capture inter-modal information while retaining the distinct intra-modal features. Furthermore, the system models the inter-speaker relations between the speakers within the same scene. The experimental evaluation validates the effectiveness of the approach, achieving an average mAP score of 94.8%.
External IDs:dblp:conf/iccel/MocanuT24
Loading