Speech Synthesis By Unrolling  Diffusion Process using Neural Network Layers

Peter Ochieng

Speech Synthesis By Unrolling Diffusion Process using Neural Network Layers

Peter Ochieng

Published: 29 Apr 2025, Last Modified: 29 Apr 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This work proposes a novel setup where a neural network is trained to predict multiple steps of the reverse diffusion process in an unrolled manner, with successive layers corresponding to equally spaced steps in the diffusion schedule. Each layer progressively denoises the input during the reverse process until the final layer estimates the original input, $x_0$. Additionally, we introduce a new learning target by using latent variables, rather than the conventional approach of predicting the original input $x_0$ or source error $\epsilon_0$. In speech synthesis, using $x_0$ or $\epsilon_0$ often leads to large prediction errors in the early stages of the denoising process, causing distortion in the recovered speech. Our method mitigates this issue and, through extensive evaluation, demonstrates the generation of high-fidelity speech in competitive time, outperforming current state-of-the-art techniques. Moreover, the proposed approach generalizes well to unseen speech. Sample audio is available at \url{https://onexpeters.github.io/UDPNet/}.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: We thank the reviewer for their constructive feedback and thoughtful suggestions. In response to the minor revision request, we have made the following updates: Comparison with Consistency Models: As recommended, we added a direct comparison with CoMoSpeech, a recent consistency model-based speech synthesizer. We evaluated it on both the single-speaker (LJSpeech) and multi-speaker (VCTK) datasets under the same experimental conditions. The results, presented in Tables 1 and 2, show that while CoMoSpeech achieves faster inference (RTF 0.0058), UDPNet (fsteps: 1200, rsteps: 8) significantly outperforms it in perceptual quality (MOS +0.31) and all objective metrics (SSL-MOS, MOSA-Net, LDNet). We also emphasized that UDPNet does not rely on teacher models or distillation, simplifying training without sacrificing performance.

Code: https://onexpeters.github.io/UDPNet/

Assigned Action Editor: ~Xu_Tan1

Submission Number: 3482

Loading