Learning Relationship Between Speaker Embeddings and Descriptions of Speaker Traits

Xuechen Liu, Junichi Yamagishi, Xin Wang, Erica Cooper

Published: 01 Jan 2026, Last Modified: 04 Mar 2026IEEE Transactions on Audio, Speech and Language ProcessingEveryoneRevisionsCC BY-SA 4.0
Abstract: Speech perception research reveals important connections between audio signals and perceptual speaker characteristics. Addressing this intersection, this study explores the relationship between textual descriptions of perceivable speaker characteristics and speech representations by establishing a joint learning space. We thus construct a dataset for this purpose created through extensive crowd-sourced listening tests based on VoxCeleb, where participants provided detailed evaluations of diverse speaker attributes. These evaluations are transformed into structured textual descriptions, creating paired data that captures nuanced speaker characteristics. Using such data, we extract speaker and text embeddings via pre-trained corresponding encoders. Additionally, Our specialized linking networks use contrastive learning and generative transformations to align these embeddings in a unified space. We apply them for cross-modal speaker retrieval in both English and Japanese, and extend to a multilingual scenario. Experimental results highlight the value of our curated dataset of listener-perceived speaker traits.
Loading