Fox-TTS: Scalable Flow Transformers for Expressive Zero-Shot Text to Speech

Fuming You; Lichao Zhang; Yuhao Luo; weishandeng; DONG XU; Yi Gao; bridgettesong; Xutongni; Kai Xu; Hu Haifeng; Zhongqian Sun; Zhou Zhao

Fox-TTS: Scalable Flow Transformers for Expressive Zero-Shot Text to Speech

Fuming You, Lichao Zhang, Yuhao Luo, weishandeng, DONG XU, Yi Gao, bridgettesong, Xutongni, Kai Xu, Hu Haifeng, Zhongqian Sun, Zhou Zhao

25 Sept 2024 (modified: 24 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Expressive Text-to-Speech, flow-matching models, zero-shot generalization

TL;DR: We propose collect the first benchmark tailored for expressive zero-shot TTS and achieves the state-of-the-art performacen (comparable to human recordings).

Abstract: Expressive zero-shot text-to-speech (TTS) synthesis aims at synthesizing high-fidelity speech that closely mimics a brief stylized recording without additional training. Despite the advancements in this area, several challenges persist: 1) Current methods, which encompass implicit prompt engineering through in-context learning or by using pre-trained speaker identification models, often struggle to fully capture the acoustic characteristics of the stylized speaker; 2) Attaining high-fidelity voice cloning for a stylized speaker typically requires large amounts of specific data for fine-tuning; 3) There is no benchmark tailored for the expressive zero-shot TTS scenarios. To address them, we present *Fox-TTS*, a family of large-scale models for high-quality expressive zero-shot TTS. We introduce an improved flow-matching Transformer model coupled with a novel learnable speaker encoder. Within the speaker encoder, we incorporate three key designs: temporal mean pooling, temporal data augmentation, and an information bottleneck used for trading off pronunciation stability and speaker similarity in an explainable manner. Moreover, we have collected \textit{Fox-eval}, the first multi-speaker, multi-style benchmark that is specially designed for expressive zero-shot scenarios. Extensive experiments show that Fox-TTS achieves on-par quality with human recordings in normal scenarios and state-of-the-art performance in expressive scenarios. Audio samples are available at https://fox-tts.github.io/.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4333

Loading