VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent AIGC systems possess the capability to generate digital multimedia content based on human language instructions, such as text, image, and video. However, when it comes to speech, existing methods related to human instruction-to-speech generation exhibit two limitations. Firstly, they require the division of inputs into the content prompt (transcript) and description prompt (style and speaker), instead of directly supporting human instruction. This division is less natural in form and does not align with other AIGC models. Secondly, the practice of utilizing an independent description prompt to model speech style, without considering the transcript content, restricts the ability to control speech at a fine-grained level. To address these limitations, we propose VoxInstruct, a novel unified multilingual codec language modeling framework that extends traditional text-to-speech tasks into a general human instruction-to-speech task. Our approach enhances the expressiveness of human instruction-guided speech generation and aligns the speech generation paradigm with other modalities. To enable the model to automatically extract the content of synthesized speech from raw text instructions, we introduce speech semantic tokens as an intermediate representation for instruction-to-content guidance. We also incorporate multiple Classifier-Free Guidance (CFG) strategies into our codec language model, which strengthens the generated speech following human instructions. Furthermore, our model architecture and training strategies allow for the simultaneous support of combining speech prompt and descriptive human instruction for expressive speech synthesis, which is a first-of-its-kind attempt.
Primary Subject Area: [Generation] Multimedia Foundation Models
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: Our work is highly relevant to the theme of “Multimedia in the Generative AI Era” as it focuses on a relatively novel task of human instruction-to-speech generation instead of traditional text (transcript)-to-speech synthesis. For "Multimedia Foundation Models", our method utilizes large language model (LLM) architecture as the backbone, and we have made improvements in modeling the mapping from textual instruction to speech content and style in LLM. The model can understand natural language instruction and follow the instructions to generate corresponding speech. This aligns with the topic of seeking state-of-the-art techniques in multimedia alignment, architecture design, new applications, and fundamental insights. For "Generative Multimedia", our work aims to leverage powerful codec language modeling techniques to achieve expressive speech synthesis. By combining AR and NAR language models, introducing CFG to Codec LM, and specifically designing training strategies, our system can generate high-quality, expressive, and realistic speech that aligns with the instruction provided by humans. This emphasis on interactive and personalized systems allows for a better user experience, which is a key aspect of the topic.
Supplementary Material: zip
Submission Number: 5555
Loading