Abstract: Mispronunciation detection and/or diagnosis (MDD) plays an important role in computer-aided language learning (CALL) systems. It remains a research challenge mainly because of the great variability that speech has. A major part of MDD research works base their approaches on models trained for automatic speech recognition (ASR). However, ASR tends to hide nuanced differences in speech for the sake of a higher recognition rate, while these differences could be vital for MDD. In this work, we propose a novel way to derive phoneme-level scores for MDD using large language model based text to speech (TTS) systems. Our proposed method is simple and efficient because speech generation from the synthesizer is not required. The results on the child speech dataset CMU-kids show that our method is comparable in performance to traditional HMM-GMM based systems for the MDD task.
External IDs:dblp:conf/mlsp/CaoFSS25
Loading