Decoupling-Enhanced Vietnamese Speech Recognition Accent Adaptation Supervised by Prosodic Domain Information

Yanwen Fang, Wenjun Wang, Ling Dong, Shengxiang Gao, Hua Lai, Zhengtao Yu

Published: 2024, Last Modified: 14 Jan 2026IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The northern and southern accents of the Vietnamese exhibit pronunciation differences in terms of tones and rhythms. Existing Vietnamese pre-trained speech models show a bias in accent representation, leading to a lack of generalization capability for southern accents in Vietnamese speech recognition models and a noticeable decrease in recognition accuracy. In this paper, we propose a decoupled-enhanced adaptive representation strategy guided by prosody and accent domain label information. Through Domain Adversarial Training (DAT), the pre-trained speech model decouples domain-invariant content features. Simultaneously, by incorporating prosodic features, we enhance the pronunciation information of Vietnamese, enabling adaptive representation of the distinctive characteristics of northern and southern Vietnamese accents. Evaluating our proposed method on a dataset of Vietnamese accents, the results demonstrate its superiority over existing approaches, mitigating the performance degradation in Vietnamese speech recognition models caused by accent differences.

External IDs:dblp:conf/ijcnn/FangWDGLY24