Abstract: In this paper, we describe a speech synthesis system that could generate speech with specified emotions and background sounds, and implement it in Track 2 of the ICAGC 2024 of ISC-SLP. This project innovatively combines two advanced audio generation models, GPT-SoVITS and AudioLDM 2, to generate emotional speech with a specific background. GPT-SoVITS is used to clone the timber and emotion from the target speaker's voice. AudioLDM 2 is employed to generate background audio according to the textual content. Ultimately, emotional speech with specific background audio can be generated by combining the outputs of the two models. Official evaluations of the generated results focused on 3 aspects: speaker similarity, audio and speech convincing matching degree, and emotional inspiration degree. Our method achieves a speaker similarity score of 3.33 and an emotional inspiration score of 3.33. In terms of convincing matching, our score is 2.66. In subjective MOS listening tests, the overall score averages 3.06 with a standard deviation of 0.41. Overall, our system secures the 2nd place.
Loading