Abstract: In this study, we introduce ExpressiveSinger, an end-to-end ex-pressive singing voices synthesis model, which accurately re-flect users' musical expression by analyzing real-played MIDI sequences and lyrics. We propose a novel method to auto-matically annotate velocity labels for MIDI sequences in SVS datasets, as these sequences do not inherently contain velocity information compared to real-played MIDI sequences. More-over, we separately model expressive features and modify the vocoder to enhance controllability and quality of the synthetic singing voices. Finally, we adopt a soft-vc like approach for end-to-end training to effectively preserve more linguistic content features. Our experiments on the professional Mandarin singing corpus validate our data annotation method and demonstrate the effectiveness of ExpressiveSinger in terms of naturalness and a strong correlation between the synthetic singing voice and the MIDI input.
Loading