Mongolian Speech Recognition Based on Semi-supervised Learning and Syllable Subword Modeling Units

Yuan Li, Yonghe Wang, Zhenjie Gao, Feilong Bao

Published: 2025, Last Modified: 13 Mar 2026NLPCC (4) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Mongolian is a low-resource language that lacks sufficient labeled data for end-to-end model training. Semi-supervised learning can significantly improve the performance of Mongolian speech recognition systems by utilizing low-cost unlabeled data. In this paper, we propose a pseudo-labeled semi-supervised speech recognition method for Mongolian based on syllable-level subword modeling units. Speech representations are extracted using a pre-trained model and discretized to generate pseudo-labels for automatic speech recognition (ASR) model pre-training. To enhance speech representation, we propose an end-to-end training method based on syllable subword modeling units. In our experiments, we utilized 1,560 h of unsupervised Mongolian speech data. The results indicate that the proposed method reduces the model’s parameter count by 2.01%, while significantly enhancing model performance. Specifically, a relative reduction of 48.46% and 26.70% in WER is achieved with 10 and 100 h of supervised data, respectively, under non-autoregressive decoding. Additionally, under autoregressive decoding, WER shows a relative reduction of 52.21% and 13.38% for the same data conditions.

External IDs:dblp:conf/nlpcc/LiWGB25