Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style Representation

Yongqi Wang; Chunlei Zhang; Hangting Chen; Zhou Zhao; Dong Yu

Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style Representation

Yongqi Wang, Chunlei Zhang, Hangting Chen, Zhou Zhao, Dong Yu

27 Sept 2024 (modified: 24 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: controllable text-to-speech, representation learning

TL;DR: We propose a two-stage language model (LM) for style-controllable TTS with a masked-autoencoded style representation as an intermediary.

Abstract: Controllable text-to-speech (TTS) systems aim to manipulate various stylistic attributes of generated speech. Despite considerable research in this area, existing models that use natural language prompts as an interface often lack the ability for fine-grained control and face a scarcity of high-quality data. To address these challenges, we propose a two-stage style-controllable TTS system with language models, utilizing a masked-autoencoded style representation as an intermediary. In our approach, we employ a masked autoencoder to learn a content-disentangled style feature of speech, which is then discretized using a residual vector quantizer. In the first stage, an autoregressive transformer is used for the conditional generation of these style tokens from text and control signals. In the second stage, we generate codec tokens from both text and sampled style tokens. Experiments demonstrate that training the first-stage model on extensive datasets enhances the robustness of the two-stage model in terms of quality and content accuracy. Additionally, our model achieves superior control over attributes such as pitch and emotion. By selectively combining discrete labels and speaker embeddings, we can fully control the speaker’s timbre and other stylistic information, or adjust attributes like pitch and emotion for a specified speaker. Audio samples are available at https://style-ar-tts.github.io.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9314

Loading