Diffusion Transformer for Adaptive Text-to-Speech

Haolin Chen; Philip N. Garner

Diffusion Transformer for Adaptive Text-to-Speech

Haolin Chen, Philip N. Garner

Published: 15 Jun 2023, Last Modified: 16 Jun 2023SSW12Readers: Everyone

Keywords: speech synthesis, adaptive TTS, diffusion transformer, adaptive layer norm

TL;DR: We adapt the Diffusion Transformer for acoustic modeling of TTS and study its adaptability in both few-parameter and zero-shot settings.

Abstract: Given the success of diffusion in synthesizing realistic speech, we investigate how diffusion can be included in adaptive text-to-speech systems. Inspired by the adaptable layer norm modules for Transformer, we adapt a new backbone of diffusion models, Diffusion Transformer, for acoustic modeling. Specifically, the adaptive layer norm in the architecture is used to condition the diffusion network on text representations, which further enables parameter-efficient adaptation. We show the new architecture to be a faster alternative to its convolutional counterpart for general text-to-speech, while demonstrating a clear advantage on naturalness and similarity over the Transformer for few-shot and few-parameter adaptation. In the zero-shot scenario, while the new backbone is a decent alternative, the main benefit of such an architecture is to enable high-quality parameter-efficient adaptation when finetuning is performed.

3 Replies

Loading