VoiceTuner: Self-Supervised Pre-training and Efficient Fine-tuning For Voice Generation

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Voice large language models (LLMs) cast voice synthesis as a language modeling task in a discrete space, and have demonstrated significant progress to date. Despite the recent success, the current development of voice LLMs in low-resource applications is hampered by data scarcity and high computational cost. In this work, we propose VoiceTuner, with a self-supervised pre-training and efficient fine-tuning approach for low-resource voice generation. Specifically, 1) to mitigate data scarcity, we leverage large-scale unlabeled dataset and pre-train VoiceTuner-SSL without pre-defined applications, which can be fine-tuned in downstream tasks; 2) to further reduce the high training cost in complete fine-tuning, we introduce a multiscale transformer adapter to effectively update only around 1% parameters as a plug-and-play module. Experimental results demonstrate that VoiceTuner-SSL presents strong acoustic continuations, and VoiceTuner achieves state-of-the-art results in rich-resource TTS evaluation compared with competitive baseline models. Low-resource (1h, 10h, 30h) downstream applications including zero-shot TTS, instruction TTS, and singing voice synthesis present VoiceTuner's superior audio quality and style similarity with reduced data requirement and computational cost. Audio samples are available at https://VoiceTuner.github.io
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Vision and Language, [Experience] Multimedia Applications
Relevance To Conference: In this work, we propose VoiceTuner with a pre-training and efficient fine-tuning approach for low-resource multimodal voice generation (with text, and audio modality). VoiceTuner exhibited superior quality and style similarity with reduced data requirement and computational cost in three multimodal voice generation tasks, including zero-shot TTS, instruction TTS, and singing voice synthesis.
Supplementary Material: zip
Submission Number: 5666
Loading