Child Speech Assessment Through Large Language Model Speech Synthesis: Preliminary Results

Xinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi

Published: 2025, Last Modified: 20 Mar 2026MLSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Mispronunciation detection and/or diagnosis (MDD) plays an important role in computer-aided language learning (CALL) systems. It remains a research challenge mainly because of the great variability that speech has. A major part of MDD research works base their approaches on models trained for automatic speech recognition (ASR). However, ASR tends to hide nuanced differences in speech for the sake of a higher recognition rate, while these differences could be vital for MDD. In this work, we propose a novel way to derive phoneme-level scores for MDD using large language model based text to speech (TTS) systems. Our proposed method is simple and efficient because speech generation from the synthesizer is not required. The results on the child speech dataset CMU-kids show that our method is comparable in performance to traditional HMM-GMM based systems for the MDD task.

External IDs:dblp:conf/mlsp/CaoFSS25