everyone
since 02 Jul 2024">EveryoneRevisionsBibTeXCC BY 4.0
Prior efforts in building computer-assisted pronunciation training (CAPT) systems often treat automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) as separate fronts. APA aims to provide multiple pronunciation aspect scores across diverse linguistic levels, while MDD focuses instead on pinpointing the precise phonetic errors made by non-native language learners. However, a full-fledged CAPT system should integrate both features simultaneously. To address this pressing need, we in this work first propose HMamba, a novel hierarchical selective state space method that jointly tackles APA and MDD tasks. In addition, to enhance model performance, we introduce a novel loss function, decoupled cross-entropy loss (deXent), specifically tailored for the MDD task to facilitate better supervised label learning. A comprehensive set of empirical results carried out on the speechocean762 benchmark dataset demonstrate the effectiveness of our approach in multi-aspect multi-granular assessments. Furthermore, our proposed approach also yields considerable improvement in MDD performance over a competitive baseline, achieving an F1-score of 63.32%.