A study on landmark detection based on CTC and its application to pronunciation error detection

Chuanying Niu, Jinsong Zhang, Xuesong Yang, Yanlu Xie

Published: 01 Jan 2017, Last Modified: 31 Jul 2024APSIPA 2017EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Acoustic features extracted in the vicinity of landmarks have demonstrated their usefulness for detecting mispronunciation in our recent work [1, 2]. Traditional approaches of detecting acoustic landmarks rely on annotations by linguists with prior knowledge of speech production mechanisms, which are laborious and expensive. This paper proposes a data-driven approach of connectionist temporal classification (CTC) that can detect landmarks without any human labels while still maintaining consistent performance with knowledge-based models for stop burst landmarks. We designed an acoustic model to predict phone labels based on a recurrent neural network (RNN) with bidirectional long short- term memory (BLSTM) units, which is trained by CTC technique. We found that the positions of spiky phone outputs of this model are consistent with the landmarks annotated in the TIMIT corpus. Both data-driven and knowledge-based landmark models are applied to detect pronunciation errors of second-language (L2) Chinese learners. Experiments illustrate that data-driven CTC landmark model is comparable to knowledge-based model in pronunciation error detection. The fusion of them can further improve performance.