A Spectral Energy Distance for Parallel Speech Synthesis
Abstract: Speech synthesis is an important practical generative modeling problem that has
seen great progress over the last few years, with likelihood-based autoregressive
neural models now outperforming traditional concatenative systems. A downside
of such autoregressive models is that they require executing tens of thousands
of sequential operations per second of generated audio, making them ill-suited
for deployment on specialized deep learning hardware. Here, we propose a new
learning method that allows us to train highly parallel models of speech, without
requiring access to an analytical likelihood function. Our approach is based on
a generalized energy distance between the distributions of the generated and real
audio. This spectral energy distance is a proper scoring rule with respect to the
distribution over magnitude-spectrograms of the generated waveform audio and
offers statistical consistency guarantees. The distance can be calculated from
minibatches without bias, and does not involve adversarial learning, yielding a
stable and consistent method for training implicit generative models. Empirically,
we achieve state-of-the-art generation quality among implicit generative models, as
judged by the recently-proposed cFDSD metric. When combining our method with
adversarial techniques, we also improve upon the recently-proposed GAN-TTS
model in terms of Mean Opinion Score as judged by trained human evaluators.
0 Replies
Loading