Abstract: In recent years, methods based on diffusion generative models have achieved state-of-the-art performances in voice generation. Most of these previous approaches are based on first-order stochastic differential equations or their equivalent diffusion models. This paper attempts to upgrade these first-order methods and propose LangWave, which uses the third-order Langevin dynamical system to generate speech waveforms. LangWave can simultaneously model the position, velocity and acceleration of voice wave diffusion and sampling in the ambient Euclidean space. Thus our vocoder can more precisely and smoothly control the wave evolution from white noise to meaningful waveforms. The experiments on the public data set LJSpeech show that the effect is significant in both objective and subjective evaluation, and achieve the new state-of-the-art MOS of 4.55. Audio samples are available at https://shiziqiang.github.io/langwave.
Loading