A Study on Mispronunciation Detection Based on Fine-grained Speech Attribute

Minghao Guo, Cai Rui, Wei Wang, Binghuai Lin, Jinsong Zhang, Yanlu Xie

Published: 2019, Last Modified: 30 Sept 2024APSIPA 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Over the last decade, several studies have investigated speech attribute detection (SAD) for improving computer assisted pronunciation training (CAPT) systems. The predefined speech attribute categories either is IPA or language dependent categories, which is difficult to handle multiple languages mispronunciation detection. In this paper, we propose a fine-grained speech attribute (FSA) modeling method, which defines types of Chinese speech attribute by combining Chinese phonetics with the international phonetic alphabet (IPA). To verify FSA, a large scale Chinese corpus was used to train Time-delay neural networks (TDNN) based on speech attribute models, and tested on Russian learner data set. Experimental results showed that all FSA's accuracy on Chinese test set is about 95% on average, and the diagnosis accuracy of the FSA-based mispronunciation detection achieved a 2.2% improvement compared to that of segment-based baseline system. Besides, as the FSA is theoretically capable of modeling language-universal speech attributes, we also tested the trained FSA-based method on native English corpus, which achieved about 50% accuracy rate.