Frequency-Selective Boosting for CFM-based Speech Synthesis via Wavelet Decomposition

Frequency-Selective Boosting for CFM-based Speech Synthesis via Wavelet Decomposition

ACL ARR 2025 May Submission7265 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Conditional Flow Matching (CFM) models have advanced text-to-speech (TTS) synthesis, yet their efficiency and fidelity can be hampered by the uncoordinated evolution of spectral features during the generative ODE trajectory. Our analysis of DWT decomposition of the mel-spectrogram establishes that this incoordination between low-frequency (approximation) and high-frequency (detail) components often leads to unnecessary interference of subsequent iterations with the past developments and thus, demands prolonged iterations to achieve faithful speech. Furthermore, we demonstrate that directly adapting existing inference-time stabilization strategies, such as those inspired by MASF for diffusion models, exhibits poor generalizability to CFM-based TTS. This is due to fundamental differences in their generative dynamics, the time-varying reliability of intermediate clean data estimates in CFM, and potential mismatches with model-specific frequency evolution. To address these limitations, we propose a novel inference-time frequency-selective boosting strategy based on Wavelet decomposition, designed to explicitly enhance and synchronize the development of distinct mel-spectrogram frequency bands during the ODE solving process. Our experiments quantify significant improvements in the faithfulness and quality of generated audio, as measured by Fréchet Audio Distance (FAD), without any degradation in Word Error Rate (WER), showcasing a more robust and efficient path to high-quality speech synthesis in CFM models.

Paper Type: Long

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: TTS, Wavelet decomposition, conditional flow matching

Contribution Types: Model analysis & interpretability, Surveys

Languages Studied: English, Hindi

Keywords: Conditional Flow Matching, Wavelet Decomposition, TTS

Submission Number: 7265

Loading