Track: Scientific Track
Keywords: Low-resource TTS, Dialect speech synthesis, Phoneme-based TTS
Abstract: Building text-to-speech (TTS) systems for low-resource languages such as Swiss German is challenging due to limited paired data and the lack of standardized orthography. In practical Swiss settings, user input is typically written in High German, motivating pipelines that map High German text to Swiss German speech via an intermediate representation. We compare three approaches: (i) direct synthesis from High German (DE-TTS), (ii) High German $\rightarrow$ Swiss German text translation followed by synthesis (CH-TTS), and (iii) High German $\rightarrow$ automatically derived fused phoneme conversion followed by synthesis (PH-TTS). Using the SwissDial dataset, we fine-tune two TTS backbones, SpeechT5 and Orpheus, and evaluate the resulting systems with closed-loop STT metrics (WER/SacreBLEU) and human MOS. Objective transcript-overlap metrics reliably penalize PH-TTS but fail to reflect human preference between DE-TTS and CH-TTS. MOS consistently ranks CH-TTS highest for both backbones, with Orpheus achieving near-original quality and showing robustness when training data is halved; notably, under the half-data setting PH-TTS becomes close to DE-TTS, suggesting that phoneme intermediates may be more competitive in lower-resource regimes. Our analysis indicates that the current PH-TTS pipeline is limited by noisy phoneme supervision and representation mismatch, and we outline directions to make phoneme intermediates competitive in low-resource dialect TTS.
Submission Number: 15
Loading