FlashSpeech: Efficient Zero-Shot Speech Synthesis

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in https://flashspeech.github.io
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Generation] Multimedia Foundation Models
Relevance To Conference: This work, focusing on efficient zero-shot speech generation systems, significantly contributes to multimedia and multimodal processing by advancing the efficiency and effectiveness of speech synthesis technology. The application of FlashSpeech in multimedia contexts is manifold, including improved real-time speech synthesis for interactive systems like virtual assistants, gaming, and augmented reality, where quick and natural speech generation enhances user experience. Moreover, the ability to generate high-quality speech rapidly supports more seamless and integrated multimodal interactions by reducing the latency between user inputs and system responses, thus fostering a more natural dialogue flow.
Submission Number: 274
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview