Char2Wav: End-to-End Speech Synthesis

Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, Yoshua Bengio

Feb 17, 2017 (modified: Apr 16, 2017) ICLR 2017 workshop submission readers: everyone
  • Abstract: We present Char2Wav, an end-to-end model for speech synthesis. Char2Wav has two components: a reader and a neural vocoder. The reader is an encoder-decoder model with attention. The encoder is a bidirectional recurrent neural network that accepts text or phonemes as inputs, while the decoder is a recurrent neural network (RNN) with attention that produces vocoder acoustic features. Neural vocoder refers to a conditional extension of SampleRNN which generates raw waveform samples from intermediate representations. Unlike traditional models for speech synthesis, Char2Wav learns to produce audio directly from text.
  • TL;DR: Unlike traditional models for speech synthesis, Char2Wav learns to produce audio directly from text.
  • Keywords: Speech, Deep learning, Applications
  • Conflicts: umontreal.ca, inrs.ca, iitk.ac.in
  • Authorids: rdz.sotelo@gmail.com, soroush.mehris@umontreal.ca, kundankumar2510@gmail.com, kastnerkyle@gmail.com, jfsantos@emt.inrs.ca, aaron.courville@umontreal.ca, yoshua.bengio@umontreal.ca

Loading