ExpressiveSinger: Multilingual and Multi-Style Score-based Singing Voice Synthesis with Expressive Performance Control

Shuqi Dai; Ming-Yu Liu; Rafael Valle; Siddharth Gururani

ExpressiveSinger: Multilingual and Multi-Style Score-based Singing Voice Synthesis with Expressive Performance Control

Shuqi Dai, Ming-Yu Liu, Rafael Valle, Siddharth Gururani

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Singing Voice Synthesis (SVS) has significantly advanced with deep generative models, achieving high audio quality but still struggling with musicality, mainly due to the lack of performance control over timing, dynamics, and pitch, which are essential for music expression. Additionally, integrating data and supporting diverse languages and styles in SVS remain challenging. To tackle these issues, this paper presents \textit{ExpressiveSinger}, an SVS framework that leverages a cascade of diffusion models to generate realistic singing across multiple languages, styles, and techniques from scores and lyrics. Our approach begins with consolidating, cleaning, annotating, and processing public singing datasets, developing a multilingual phoneme set, and incorporating different musical styles and techniques. We then design methods for generating expressive performance control signals including phoneme timing, F0 curves, and amplitude envelopes, which enhance musicality and model consistency, introduce more controllability, and reduce data requirements. Finally, we generate mel-spectrograms and audio from performance control signals with style guidance and singer timbre embedding. Our models also enable trained singers to sing in new languages and styles. A listening test reveals both high musicality and audio quality of our generated singing compared with existing works and human singing. We release the data for future research. Demo: We release the data for future research. Demo: https://expressivesinger.github.io/ExpressiveSinger.

Primary Subject Area: [Generation] Generative Multimedia

Secondary Subject Area: [Experience] Art and Culture, [Experience] Multimedia Applications

Relevance To Conference: It will need the next generation of SVS. It shows that in Singing Voice Synthesis, it is time to switch our focus from building good-quality synthesizers to modeling expressive performance control, and how it could make a difference.

Supplementary Material: zip

Submission Number: 5257

Loading