Midi-Voice: Expressive Zero-Shot Singing Voice Synthesis via Midi-Driven Priors

Dong-Min Byun, Sang-Hoon Lee, Ji-Sang Hwang, Seong-Whan Lee

Published: 01 Jan 2024, Last Modified: 14 May 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, singing voice synthesis (SVS) models have shown significant progress with generative models. However, previous SVS models inaccurately predict prior and fundamental frequency (F0) for unseen speakers, resulting in a low-quality generated singing voice. To address these issues, in this paper, we propose MIDI-Voice for expressive singing voice synthesis and robust zero-shot singing voice style transfer. We employ a MIDI-based prior to a score-based diffusion model for better singing voice style adaptation. We first generate a MIDI-driven prior from the musical score, and this only includes the note information, not speaker information resulting in high-quality singing voice style adaptation. We also propose a DDSP-based MIDI-style prior for synthesizing a more expressive singing voice and for singing style adaptation, although it requires additional information from the audio. The experimental results show that MIDI-Voice outperforms the previous models in synthesizing an expressive singing voice, and also the superiority in zero-shot singing voice style transfer performance.