HMamba: Towards Multifaceted Computer-assisted Pronunciation Training Leveraging Hierarchical Selective State Space Model and Decoupled Cross-entropy Loss
Abstract: Prior efforts in building computer-assisted pronunciation training (CAPT) systems often treat automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) as separate fronts. APA aims to provide multiple pronunciation aspect scores across diverse linguistic levels, while MDD focuses instead on pinpointing the precise phonetic errors made by non-native language learners. However, a full-fledged CAPT system should integrate both features simultaneously. To address this pressing need, we in this work first propose HMamba, a novel hierarchical selective state space method that jointly tackles APA and MDD tasks. In addition, to enhance model performance, we introduce a novel loss function, decoupled cross-entropy loss (deXent), specifically tailored for the MDD task to facilitate better supervised label learning. A comprehensive set of empirical results carried out on the speechocean762 benchmark dataset demonstrate the effectiveness of our approach in multi-aspect multi-granular assessments. Furthermore, our proposed approach also yields considerable improvement in MDD performance over a competitive baseline, achieving an F1-score of 63.32%.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: multi-task learning, self-supervised learning, optimization methods, automatic speech recognition, educational applications, speech technologies
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English, Mandarin
Submission Number: 1867
Loading