Keywords: Computer Vision, Face Recognition, Speech Recognition, Cross-modality, Deep Learning
Abstract: The relationship between voice and face is well-established in neuroscience and biology. Recent algorithmic advancements have yielded substantial improvements in voice face matching. However, these approaches predominantly achieve success by leveraging datasets with diverse demographic characteristics, which inherently provide greater inter-speaker variability. We address the challenging problem of voice face matching and retrieval in homogeneous datasets, where speakers share gender and ethnicity. Our novel deep architecture, featuring a weighted triplet loss function based on face distances, achieves state-of-the-art performance for voice face matching on these uniform populations. We evaluate our model on a sequence of homogeneous datasets containing only voices and faces of people sharing gender and ethnic group. In addition, we introduce percentile-recall, a new metric for evaluating voice face retrieval tasks.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10655
Loading