Spatial Speaker ID: Joint Spatial and Semantic Learning for Multi-Microphone Speaker Identification on Short Far-Field Utterances

28 Sept 2024 (modified: 14 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Representation learning, speaker identification, multi-channel audio
TL;DR: We propose the a new machine learning task which allows models to learn representations of multi-microphone audio that capture both spatial and voice characteristic information on short far-field utterances.
Abstract: Speaker identification is the task of identifying a person who is currently talking by analysing microphone signals. Typical automatic speaker identification systems use a single microphone and require complete utterances of 10-30 seconds in length to accurately identify a person from an enrollment set. We introduce the related problem of detecting which person is talking among several people in a room when the utterances are very short, e.g., a single word, or a short laugh. Since utterance lengths are too short for conventional methods, we take inspiration from the way humans solve this problems - using two ears and a joint understanding of both semantic and spatial context. To solve this problem, we propose Spatial Speaker ID, which uses banded covariance features derived from multi-microphone input along with conventional banded power to identify talkers based on both the semantic characteristics of a sound and the spatial location of a sound. The internal representation learnt in Spatial Speaker ID jointly contains both spatial and voice characteristic information and is learnt contrastively, whereby two utterances that come from the same talker in the same location are required to have similar embeddings. We learn a binary classification downstream task that determines if two sets of embeddings come from the same talker in the same location. Using this binary classifier, we compare multiple ways of presenting the microphone covariance features to the upstream models. We show the importance of spatial information for identifying talkers on short utterances with interfering noise.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13242
Loading