End-to-End Thai Text-to-Speech with Linguistic Unit

Kontawat Wisetpaitoon, Sattaya Singkul, Theerat Sakdejayont, Tawunrat Chalothorn

Published: 2024, Last Modified: 19 Feb 2025ICMR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this study, we explore the influence of Thai Linguistic Units (TH-LUs) and speech trimming on the state-of-the-art Thai Text-to-Speech (TTS) systems. We propose an end-to-end Thai TTS framework that emphasizes phonemes, syllables, and words, essential for accurate text pronunciation. To thoroughly investigate these aspects, we designed two main experiments: the TH-LU factor experiment and the TH-LU with speech trimming factor experiment. Our assessment targeted speaker tone and pronunciation accuracy. VITS model demonstrated a standout performer in tonal accuracy, which is evaluated by the Speaker Encoder Cosine Similarity (SECS) method, across different TH-LUs in both trim and non-trim speech training data. For pronunciation accuracy, we integrated a Thai speech-to-text model to evaluate. Our results indicate that VITS with the word linguistic unit outperforms all baselines in overall performance, excelling in both speaker tone and pronunciation accuracy. This research significantly advances the field of TTS, particularly for the Thai language, by highlighting the importance of diverse TH-LU and speech trimming in TTS model development and underlining the need for evaluation methods that account for both tonal accuracy and pronunciation quality.