Abstract: Although existing Text-To-Speech (TTS) synthesizers are able to generate high-quality speech in most cases, their overall performance is still affected by the distribution of the training data. When processing tasks that involve complex data distributions, such as code-switching TTS, these models might generate speech that sounds unnatural or has low speaker similarity. In this paper, we propose CosDiff, a Code-Switching TTS model based on a multi-task Denoising Diffusion Implicit Model (DDIM), which integrates Voice Conversion (VC) and TTS functionalities. Utilizing the VC function, we construct a single-speaker bilingual dataset for training, achieving a superior code-switching synthesis performance compared to the outcomes of the speaker encoder method, which trains with multiple single-speaker monolingual datasets. In addition, we employ strategies of directly predicting the clean data x0 and progressive diffusion distillation, further accelerating the model’s sampling process. The experimental results demonstrate the efficacy of this method in improving the quality of generation, increasing sampling speed, and distilling the model.
Loading