Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation

Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation

ACL ARR 2024 December Submission1598 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Controllable text-to-speech (TTS) systems aim to manipulate various stylistic attributes of generated speech. Existing models that use natural language prompts as an interface often lack the ability for fine-grained control and face a scarcity of high-quality data. To address these challenges, we propose a two-stage style-controllable TTS system with language models, utilizing a masked-autoencoded representation as an intermediary. We employ a masked autoencoder to learn a speech feature rich in stylistic information, which is then discretized using a residual vector quantizer. In the first stage, an autoregressive transformer is used for the conditional generation of these style-rich tokens from text and control signals. In the second stage, we generate codec tokens from both text and sampled style-rich tokens. Experiments demonstrate that training the first-stage model on extensive datasets enhances the robustness of the two-stage model in terms of quality and content accuracy. Additionally, our model achieves superior control over attributes such as pitch and emotion. By selectively combining discrete labels and speaker embeddings, we can fully control the speaker’s timbre and other stylistic information, or adjust attributes like emotion for a specified speaker. Audio samples are available at https://style-ar-tts.github.io .

Paper Type: Long

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: speech technologies, text-to-speech

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 1598

Loading