Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Wei Ping; Kainan Peng; Andrew Gibiansky; Sercan O. Arik; Ajay Kannan; Sharan Narang; Jonathan Raiman; John Miller

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, John Miller

15 Feb 2018 (modified: 22 Jun 2025)ICLR 2018 Conference Blind SubmissionReaders: Everyone

Abstract: We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training an order of magnitude faster. We scale Deep Voice 3 to dataset sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on a single GPU server.

Keywords: 2000-Speaker Neural TTS, Monotonic Attention, Speech Synthesis

Code: [![Papers with Code](/images/pwc_icon.svg) 7 community implementations](https://paperswithcode.com/paper/?openreview=HJtEm4p6Z)

Data: [LibriSpeech](https://paperswithcode.com/dataset/librispeech)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 6 code implementations](https://www.catalyzex.com/paper/deep-voice-3-scaling-text-to-speech-with/code)

8 Replies

Loading