Speech Synthesis By Unrolling Diffusion Process using Neural Network Layers

TMLR Paper3482 Authors

12 Oct 2024 (modified: 26 Nov 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: This work proposes a novel setup where a neural network is trained to predict multiple steps of the reverse diffusion process in an unrolled manner, with successive layers corresponding to equally spaced steps in the diffusion schedule. Each layer progressively denoises the input during the reverse process until the final layer estimates the original input, $x_0$. Additionally, we introduce a new learning target by using latent variables, rather than the conventional approach of predicting the original input $x_0$ or source error $\epsilon_0$. In speech synthesis, using $x_0$ or $\epsilon_0$ often leads to large prediction errors in the early stages of the denoising process, causing distortion in the recovered speech. Our method mitigates this issue and, through extensive evaluation, demonstrates the generation of high-fidelity speech in competitive time, outperforming current state-of-the-art techniques. Moreover, the proposed approach generalizes well to unseen speech. Sample audio is available at \url{https://onexpeters.github.io/UDPNet/}.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Xu_Tan1
Submission Number: 3482
Loading