LatentSpeech: Latent Diffusion for Text-To-Speech Generation

Published: 2025, Last Modified: 23 Jan 2026RO-MAN 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Text-To-Speech (TTS) generation plays a crucial role in human-robot interaction by allowing robots to communicate naturally with humans. Researchers have developed various TTS models to enhance speech generation. More recently, diffusion models have emerged as a powerful generative framework, achieving state-of-the-art performance in tasks such as image and video generation. However, their application in TTS has been limited by its slow inference speeds due to their iterative denoising process. Previous work has applied diffusion models to Mel-Spectrograms with an additional vocoder to convert them into waveforms. To address these limitations, we propose LatentSpeech, a novel diffusion-based TTS framework that operates directly in a latent space. This space is significantly more compact and information-rich than raw Mel-Spectrograms. Furthermore, we introduce an alternative latent space of Pseudo-Quadrature Mirror Filters (PQMF), which decomposes speech into multiple subbands. By leveraging PQMF’s near-perfect waveform reconstruction capability, LatentSpeech eliminates the need for a separate vocoder and reduces both model size and inference time. Our PQMF-based LatentSpeech model reduces inference time by 45% and model size by 77% compared to Mel-Spectrogram diffusion models. On benchmark datasets, it achieves 25% lower WER and 58% higher MOS using the same training data. These results highlight LatentSpeech as an efficient, high-quality TTS solution for real-time and human-robot interaction. Code and models are available here.
Loading