RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse SynthesisDownload PDF

Published: 15 Jun 2021, Last Modified: 05 May 2023INNF+ 2021 posterReaders: Everyone
Keywords: text to speech, audio synthesis, TTS, normalizing flows
Abstract: This work introduces a predominantly parallel, end-to-end TTS model based on normalizing flows. It extends prior parallel approaches by additionally modeling speech rhythm as a separate generative distribution to facilitate variable token duration during inference. We further propose a robust framework for the on-line extraction of speech-text alignments -- a critical yet highly unstable learning problem in end-to-end TTS frameworks. Our experiments demonstrate that our proposed techniques yield improved alignment quality, better output diversity compared to controlled baselines.
3 Replies