Abstract: Given the scarcity of Code-Switching (CS) datasets, most researchers synthesize CS speech using multiple monolingual datasets. However, this approach presents challenges in synthesizing CS speech, such as difficulty controlling the speaker's identity and causing low intelligibility of the generated speech. This letter proposes UnitDiff, a CS speech synthesis model based on the unit-diffusion framework. The model employs the self-supervised high-level representation ’soft unit' extracted from soft HuBERT to directly predict a clean mel-spectrogram $x_{0}$. This approach enhances control over speaker identity. A language tagging method is also introduced to improve speech intelligibility. Evaluation results validate the model's effectiveness in improving the intelligibility, speaker similarity, and speaker consistency of the generated CS speech.
Loading