The VIBVG Speech Synthesis System for Blizzard Challenge 2023

Published: 01 Jan 2023, Last Modified: 14 Apr 2025Blizzard Challenge 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The paper describes the VIBVG end-to-end neural text to speech (TTS) synthesis system entry for Blizzard Challenge 2023. One objective of the challenge is to synthesize natural and high-quality audio. Another objective is to generate audio that closely resembles the speech of the target person. Our speech synthesis system is built based on VITS, which is a multi-speaker end-to-end speech synthesis system. Diverging from VITS, we have incorporated BigVGAN as the decoder instead of HiFi-GAN to enhance the quality of synthesized speech. Furthermore, to improve the naturalness of speech synthesis, we conducted a comparative analysis of various French grapheme-to-phoneme (g2p) methods and employed certain modifications to the generated French phonemes. In this paper, the whole system structure, data pruning method will be presented and discussed. In addition, we will introduce the important parts of each task respectively. Finally, the results of listening test are presented and we will conduct some analysis on the results.
Loading