Abstract: Conditional Flow Matching (CFM) models have advanced text-to-speech (TTS) synthesis, yet their efficiency and fidelity can be hampered by the uncoordinated evolution of spectral features during the generative ODE trajectory. Our analysis of DWT decomposition of the mel-spectrogram establishes that this incoordination between low-frequency (approximation) and high-frequency (detail) components often leads to unnecessary interference of subsequent iterations with the past developments and thus, demands prolonged iterations to achieve faithful speech. Furthermore, we demonstrate that directly adapting existing inference-time stabilization strategies, such as those inspired by MASF for diffusion models, exhibits poor generalizability to CFM-based TTS. This is due to fundamental differences in their generative dynamics, the time-varying reliability of intermediate clean data estimates in CFM, and potential mismatches with model-specific frequency evolution. To address these limitations, we propose a novel inference-time frequency-selective boosting strategy based on Wavelet decomposition, designed to explicitly enhance and synchronize the development of distinct mel-spectrogram frequency bands during the ODE solving process. Our experiments quantify significant improvements in the faithfulness and quality of generated audio, as measured by Fréchet Audio Distance (FAD), without any degradation in Word Error Rate (WER), showcasing a more robust and efficient path to high-quality speech synthesis in CFM models.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: TTS, Wavelet decomposition, conditional flow matching
Contribution Types: Model analysis & interpretability, Surveys
Languages Studied: English, Hindi
Keywords: Conditional Flow Matching, Wavelet Decomposition, TTS
Submission Number: 7265
Loading