Latent Insights: Exploring Phoneme Diversity in Natural and Synthetic Speech through Latent Representations

Diptasree Debnath, Helard Becerra Martinez, Andrew Hines

Published: 2025, Last Modified: 06 May 2026SpeD 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The growing use of synthetic speech highlights the need to understand its differences from natural speech. Synthetic speech provides potential advantages in data augmentation, including privacy, security, and ethical data sourcing. However, the naturalness of synthetic speech is both a limitation and a risk, with deepfakes facilitating misinformation and fraud. These challenges underscore the need for improved methods to evaluate synthetic speech quality and deepfake detection. This research investigates whether latent representations from self-supervised models can identify and quantify differences between natural and synthetic speech regarding phoneme type, stress, manner, and roundedness. Our objective is to determine if models learn all phoneme categories equally or if certain groups present greater challenges, revealing limitations in synthetic speech. We pre-trained two wav2vec 2.0 models using matched natural and synthetic speech datasets. We mapped the learned codeword dictionaries to labeled test data with phoneme-level annotations and analysed the distribution and diversity of these latent representations across different phonemic categories. Our findings indicate a general lack of phonetic diversity in synthetic speech, with stress and manner showing the largest disparities. Vowels and diphthongs consistently exhibit reduced diversity. Identifying and quantifying differences in latent representations can be applied to enable enhanced synthetic speech generation, improve classification accuracy, and help to develop robust quality measurement metrics for synthetic speech.

External IDs:dblp:conf/sped/DebnathMH25