Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

Ziyue Jiang; Yi Ren; Ruiqi Li; Boyang Zhang; Shengpeng Ji; Xiaoda Yang; Jialong Zuo; Qian Yang; Zhenhui Ye; Chen Zhang; Yu Zhang; Wenrui Liu; Rui Liu; Xiang Yin; Zhou Zhao

Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

Ziyue Jiang, Yi Ren, Ruiqi Li, Boyang Zhang, Shengpeng Ji, Xiaoda Yang, Jialong Zuo, Qian Yang, Zhenhui Ye, Chen Zhang, Yu Zhang, Wenrui Liu, Rui Liu, Xiang Yin, Zhou Zhao

15 Sept 2024 (modified: 15 Feb 2025)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Zero-Shot Speech Synthesis, Large-Scale TTS, Accented TTS

TL;DR: This paper introduces a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT), which combines the advantages of fully end-to-end methods and duration-based methods.

Abstract: While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) autoregressive large language models are inefficient and not robust in long-sentence inference; 2) non-autoregressive diffusion models without explicit speech-text alignment require substantial model capacity for alignment learning; 3) \textcolor{red}{predefined alignment-based diffusion models suffer from naturalness constraints of forced alignments} and a complicated inference pipeline. This paper introduces \textit{S-DiT}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, 1) we provide sparse alignment boundaries to S-DiT to reduce the difficulty of alignment learning without limiting \textcolor{red}{the search space}; 2) to simplify the overall pipeline, we propose a unified frontend language model (F-LM) training framework to cover various speech processing tasks required by TTS models. Additionally, we adopt the piecewise rectified flow technique to accelerate the generation process and employ a multi-condition classifier-free guidance strategy for accent intensity adjustment. Experiments demonstrate that S-DiT matches state-of-the-art zero-shot TTS speech quality while maintaining a more efficient pipeline. Moreover, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 879

Loading