Abstract: There has been a growing interest in using end-to-end acoustic models for singing voice synthesis (SVS). Typically, these models require an additional vocoder to transform the generated acoustic features into the final waveform. However, since the acoustic model and the vocoder are not jointly optimized, a gap can exist between the two models, leading to suboptimal performance. Although a similar problem has been addressed in text-to-speech (TTS) systems through joint training or by replacing acoustic features with a latent representation, adopting these approaches for SVS is challenging. The improvement of joint training in SVS systems has not been thoroughly explored. In this paper, we conduct a systematic investigation into how to enhance the joint training of an acoustic model and a vocoder for SVS. We carry out extensive experiments and demonstrate that our joint-training strategy outperforms base-line methods, achieving more stable performance across different datasets while also increasing the interpretability of the entire framework.
Loading