Self-motion as supervision for egocentric audiovisual localization

Published: 19 Jan 2024, Last Modified: 16 Dec 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: Sound source localization is a key requirement for many assistive applications of augmented reality, such as speech enhancement. In conversational settings, potential sources of interest may be approximated by active speaker detection. However, localizing speakers in crowded, noisy environments is challenging, particularly without extensive ground truth annotations. Still, people are often able to communicate effectively in these scenarios through orienting behavioral responses, such as head motion and eye gaze, which have been shown to correlate with directions of auditory sources. In the absence of ground truth annotations, we propose joint training of egocentric audiovisual localization with behavioral pseudolabels to relate audiovisual stimuli with directional information extracted from future behavior. We evaluate this method as a technique for unsupervised egocentric active speaker localization and compare pseudolabels derived from head and gaze directions against fully-supervised alternatives.
Loading