Parallel Neural Text-to-Speech

Kainan Peng; Wei Ping; Zhao Song; Kexin Zhao

Parallel Neural Text-to-Speech

Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao

25 Sept 2019 (modified: 22 Jun 2025)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: text-to-speech, non-autoregressive model, parallel decoding

Abstract: In this work, we first propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and obtains 46.7 times speed-up over Deep Voice 3 at synthesis while maintaining comparable speech quality using a WaveNet vocoder. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Based on ParaNet, we build the first fully parallel neural text-to-speech system using parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We investigate several parallel vocoders within the TTS system, including variants of IAF vocoders and bipartite flow vocoder.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 5 code implementations](https://www.catalyzex.com/paper/parallel-neural-text-to-speech/code)

Original Pdf: pdf

9 Replies

Loading