Cross-Dialect Text-to-Speech In Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level Bert

Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari

Published: 2024, Last Modified: 18 Mar 2026SLT 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We explore cross-dialect text-to-speech(CD-TTS),a task to synthesize learned speakers’voices in non-native dialects,especially in pitch-accent languages.CD-TTS is important for developing voice agents that naturally communicate with people across regions.We present a novel TTS model comprising three sub-modules to perform competitively at this task.We first train a backbone TTS model to synthesize dialect speech from a text conditioned on phoneme-level accent latent variables(ALVs)extracted from speech by a reference encoder. Then,we train an ALV predictor to predict ALVs tailored to a target dialect from input text leveraging our novel multi-dialect phoneme-level BERT.We conduct multi-dialect TTS experiments and evaluate the effectiveness of our model by comparing it with a baseline derived from conventional dialect TTS methods.The results show that our model improves the dialectal naturalness of synthetic speech in CD-TTS.

External IDs:dblp:conf/slt/YamauchiSS24