Fast, High-Quality and Parameter-Efficient Articulatory Synthesis Using Differentiable DSP

Yisi Liu, Bohan Yu, Drake Lin, Peter Wu, Cheol Jun Cho, Gopala Krishna Anumanchipalli

Published: 2024, Last Modified: 06 Jan 2026SLT 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Articulatory trajectories like electromagnetic articulography (EMA) provide a low-dimensional representation of the vocal tract filter and have been used as natural, grounded features for speech synthesis. Differentiable digital signal processing (DDSP) is a parameter-efficient framework for audio synthesis. Therefore, integrating low-dimensional EMA features with DDSP can significantly enhance the computational efficiency of speech synthesis. In this paper, we propose a fast, high-quality, and parameter-efficient DDSP articulatory vocoder that can synthesize speech from EMA, F0, and loudness. We incorporate several techniques to solve the harmonics / noise imbalance problem, and add a multiresolution adversarial loss for better synthesis quality. Our model achieves a transcription word error rate (WER) of 6.67% and a mean opinion score (MOS) of 3.74, with an improvement of 1.63% and 0.16 compared to the state-of-the-art (SOTA) baseline. Our DDSP vocoder is 4.9 x faster than the baseline on CPU during inference, and can generate speech of comparable quality with only 0.4 M parameters, in contrast to the 9 M parameters required by the SOTA.

External IDs:dblp:conf/slt/LiuYLWCA24