TL;DR: DMOSpeech achieves highly efficient and accurate zero-shot speech synthesis by directly optimizing a distilled diffusion model based on objective quality metrics.
Abstract: Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation attempts have shown consistent quality degradation. Moreover, existing TTS approaches are limited by non-differentiable components or iterative sampling that prevent true end-to-end optimization with perceptual metrics. We introduce DMOSpeech, a distilled diffusion-based TTS model that uniquely achieves both faster inference and superior performance compared to its teacher model. By enabling direct gradient pathways to all model components, we demonstrate the first successful end-to-end optimization of differentiable metrics in TTS, incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss. Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude. This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization. The audio samples are available at https://dmospeech.github.io/demo
Lay Summary: Our research addresses key limitations in advanced artificial speech generation, particularly with systems based on 'diffusion models.' While these models excel at producing natural-sounding voices, they are often slow—requiring many steps to generate speech—and difficult to precisely control for specific qualities like clarity or speaker resemblance. This means current methods struggle to efficiently create voices that are both high-fidelity and truly match a target speaker.
DMOSpeech introduces a novel approach to overcome these challenges. We developed a technique called 'distillation' to train a much smaller, faster model from a large, complex diffusion model, drastically reducing the time it takes to generate speech. Crucially, instead of simply imitating the larger model, our method directly optimizes for measurable improvements in speech quality. We guide the model to enhance aspects like how understandable the words are and how closely the voice sounds like a specific person.
The impact of DMOSpeech is significant: it allows for the rapid creation of artificial voices that are not only incredibly clear but also achieve unprecedented accuracy in matching a target speaker. Our system produces voices that, in many cases, are perceived as more similar to the original speaker than the original recordings themselves. This breakthrough paves the way for more efficient and customizable voice synthesis applications, from more natural-sounding virtual assistants to personalized audio content.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Language, Speech and Dialog
Keywords: text-to-speech, zero-shot speech synthesis, diffusion model, diffusion distillation, metric optimization
Submission Number: 5487
Loading