Zero-Shot Non-Autoregressive TTS Beyond Autoregressive Models Using Soft Alignment Generation and Residual Modeling
Keywords: Speech synthesis, zero-shot text-to-speech, non-autoregressive model, generative model
TL;DR: This study proposes three methods to bridge the performance gap between non-autoregressive TTS and autoregressive TTS.
Abstract: Autoregressive TTS leverages soft alignment generated by the attention mechanism, which provides the decoder with a well-designed context vector. Subsequently, the decoder receives both the semantic representation and the acoustic representation generated at the previous time step. For this reason, autoregressive TTS achieves strong performance. Thus, we propose novel algorithms to bring similar benefits to non-autoregressive TTS. First, we propose a method to distill soft alignments—originally provided by attention in autoregressive models—into a flow matching model trained between mel-spectrograms and text representations. This allows non-autoregressive models to leverage attention-like context vectors without requiring autoregressive decoding. Second, we introduce an invertible encoder, designed based on normalizing flow, to disentangle semantic and residual acoustic representations. The invertible encoder maps the residual information, which is absent in the context vector, closer to a Gaussian distribution. During inference, we can treat the context vector as the semantic representation and Gaussian noise as the acoustic representation. Lastly, to improve zero-shot TTS performance, we propose a prompt-aware lightweight convolution, where the kernel weights are dynamically adjusted for each speech prompt. With the proposed methods, our non-autoregressive TTS model achieves comparable performance to existing autoregressive models.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24126
Loading