Abstract: Articulatory synthesis generates speech by modeling vocal tract configurations, but estimating articulatory parameters from audio-the acoustic-to-articulatory inversion (AAI) problem-remains challenging due to data scarcity, ambiguity, and the limitations of optimization-based methods. We propose PinkVocalTransformer, a Transformer framework that reformulates AAI as a sequence-to-sequence classification task over 44-dimensional vocal tract diameter sequences derived from the Pink Trombone physical synthesizer. By modeling complete tract shapes rather than higher-level articulatory trajectories, our approach yields a more interpretable and spatially consistent representation. To enable supervised learning, we generated over four million synthetic audio–parameter pairs under controlled static configurations. HuBERT embeddings improve feature extraction and robustness to real audio inputs. Reformulating regression as classification helps mitigate convergence issues arising from multimodal parameter distributions, leading to more stable predictions. Since ground-truth articulatory data are unavailable for real recordings, we regenerate audio from predicted parameters to indirectly evaluate reconstruction quality. Experiments show PinkVocalTransformer outperforms VAE-based and optimization baselines in vowel reconstruction. Objective ViSQOL metrics and ABX listening tests confirm higher perceptual similarity and listener preference for the regenerated audio compared to baselines. While the model performs strongly on static and simple dynamic segments, future work will focus on extending coverage to more diverse articulatory transitions and adapting the framework to more complex vocal tract models. Overall, this approach provides an efficient, data-driven framework for recovering interpretable articulatory parameters from audio, demonstrating both improved reconstruction quality and perceptual similarity compared to existing baselines.
External IDs:dblp:conf/specom/XuR25
Loading