Speeding Up Speech Synthesis in Diffusion Models by Reducing Data Distribution Recovery Steps via Content Transfer.
Abstract: Diffusion based vocoders have been criticised for being slow due to the many steps required
during sampling. Moreover, the model’s loss function that is popularly implemented is
designed such that the target is the original input x0 or error ε0. For early time steps of the
reverse process, this results in large prediction errors, which can lead to speech distortions
and increase the learning time. We propose a setup where the targets are the different
outputs of forward process time steps with a goal to reduce the magnitude of prediction
errors and reduce the training time. We use the different layers of a neural network (NN) to
perform denoising by training them to learn to generate representations similar to the noised
outputs in the forward process of the diffusion. The NN layers learn to progressively denoise
the input in the reverse process until finally the final layer estimates the clean speech. To
avoid 1:1 mapping between layers of the neural network and the forward process steps, we
define a skip parameter τ > 1 such that an NN layer is trained to cumulatively remove the
noise injected in the τ steps in the forward process. This significantly reduces the number
of data distribution recovery steps and, consequently, the time to generate speech. We show
through extensive evaluation that the proposed technique generates high-fidelity speech in
competitive time that outperforms current state-of-the-art tools. The proposed technique
is also able to generalize well to unseen speech.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have introduced structure on the paper and corrected errors on equations. We have also evaluated the proposed technique on new metrics
Assigned Action Editor: ~Brian_Kingsbury1
Submission Number: 1666
Loading