Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Audiovisual, multimodal, active speaker
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We approach the ASD task in the egocentric domain following a novel strategy based on sequence-to-sequence modeling
Abstract: Current methods for Active Speaker Detection (ASD) have achieved remarkable performance in commercial movies and social media videos. However, the recent release of the Ego4D dataset has shown the limitations of contemporary ASD
methods when applied in the egocentric domain. In addition to the inherent challenges of egocentric data, egocentric video brings a novel prediction target to the ASD task, namely the camera wearer’s speech activity. We propose a comprehensive approach to ASD in the egocentric domain that can model all the prediction targets (visible speakers, camera wearer, and global speech activity). Moreover, our proposal is fully instantiated inside a multimodal transformer module, thereby allowing it to operate in an end-to-end fashion over diverse modality encoders. Through extensive experimentation, we show that this flexible attention mechanism allows us to correctly model and estimate the speech activity of all the visible and unseen persons in a scene. Our proposal (ASD-Mixer) achieves state-
of-the-art performance in the challenging Ego4D Dataset, outperforming previous state-of-the-art by at last 4.41%.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 902
Loading