Keywords: Text-to-Speech, Zero-shot Speech Synthesis, Discrete Flow Matching, Low-Latency Speech Synthesis, Non-Autoregressive
TL;DR: DiFlow-TTS is the first zero-shot TTS framework using DFM to learn probability flows in discrete space of factorized codec tokens, achieving natural speech with accurate prosody, speaker cloning, 11.7× smaller size, and 34× faster inference.
Abstract: Despite flow matching and diffusion models having emerged as powerful generative paradigms that advance zero-shot text-to-speech (TTS) systems in continuous settings, they continue to fall short in capturing high-quality speech attributes such as naturalness, similarity, and prosody. A key reason for this limitation is that continuous representations often entangle these attributes, making fine-grained control and generation more difficult. Discrete codec representations offer a promising alternative, yet most flow-based methods embed tokens into a continuous space before applying flow matching, diminishing the benefits of discrete data. In this work, we present DiFlow-TTS, which, to the best of our knowledge, is the first model to investigate discrete flow matching directly to generate high-quality speech from discrete inputs. Leveraging factorized speech attributes, DiFlow-TTS introduces a factorized flow prediction mechanism that simultaneously predicts prosody and acoustic detail through separate heads, enabling explicit modeling of aspect-specific distributions. Experimental results demonstrate that DiFlow-TTS delivers strong performance across several metrics, while maintaining a compact model size up to 11.7 times smaller and low-latency inference that generates speech up to 34 times faster than recent state-of-the-art baselines. Code and audio samples are available on our demo page: https://diflow-tts.github.io
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 13864
Loading