Changanya Ndimi: Code-Switched Speech Generation via a Diffusion Prior and Linguistic Constraints

ICLR 2026 Conference Submission19549 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Code-switching, speech, low resource, Difussion
Abstract: We consider the problem of generating code-switched speech by editing monolingual utterances using a pre-trained generative prior and a set of linguistic constraints. In our setting, the prior is a diffusion-based speech synthesis model trained independently on monolingual data, while the constraints are differentiable functions that guide the insertion of foreign language segments. These constraints include a multilingual language classifier and a contrastively trained segment encoder, which together ensure that inserted content is linguistically plausible, semantically coherent, and socio-linguistically grounded.By iteratively modifying the noisy speech representation and conditioning on segment-level constraints, our model performs targeted segment replacements without requiring parallel code-switched data. We evaluate our system on a semantically aligned speech corpus spanning five African languages from three major phyla. The resulting speech achieves a COMET score of 0.815 and a LaBSE similarity of 0.880 at the segment level, while preserving speaker identity with an Equal Error Rate of 6.7\%. It also reproduces natural code-switching patterns in frequency, position, and alternation rate without explicit supervision.To our knowledge, this is the first approach that enables controlled multi-language infusion in a single utterance, producing fluent, coherent, and sociolinguistically realistic code-switched speech. Our work demonstrates that guided diffusion is a promising plug-and-play mechanism for cross-lingual and low-resource speech generation tasks.
Primary Area: generative models
Submission Number: 19549
Loading