Abstract: This study explores optimisation techniques for refining articulatory parameters in the Pink Trombone, a simplified physical speech synthesiser, to accurately emulate male and female vocal tract characteristics in non-speech sounds. We employ black-box and grey-box approaches, leveraging a genetic optimiser and Mel-spectrogram representations to infer articulatory configurations from human recordings via direct spectral comparison. Optimisation is performed over time windows to ensure temporal coherence, introducing modifications to SOTA objective metrics. We integrate grey-box strategies, incorporating pYIN for fundamental frequency estimation and a ResNet-based neural network as a neural codebook to enhance the optimisation process. Our findings confirm the synthesiser’s ability to replicate human vocalisations, achieving superior performance over existing techniques in subjective evaluations. We refined the perceptual metric ViSQOL, providing a calibrated framework for future auditory assessments in physical speech synthesis. These contributions establish a methodology for articulatory parameter estimation, improving synthesis quality and expanding vocalisation modelling and analysis applications.
External IDs:dblp:journals/ejasmp/CamaraBR25
Loading