Pre-training Language Model for Mongolian with Agglutinative Linguistic Knowledge Injection

Published: 01 Jan 2024, Last Modified: 15 May 2025IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: BERT based Pre-training Language Model (PLM) has become a crucial step in achieving the best results in various natural language processing (NLP) tasks. However, the current progress, which mainly focuses on major languages such as English and Chinese, has not thoroughly investigated the low-resource languages, particularly agglutinative languages like Mongolian, due to the scarcity of large-scale data resources and the difficulty of understanding agglutinative knowledge. In this paper, we propose a novel PLM for the Mongolian language, that incorporates a novel three-stage agglutinative knowledge injection strategy. Specifically, early-stage injection aims to convert the Mongolia word sequence to the fine-grained sub-word token that comprises a stem and some suffixes; Middle-stage injection designed a morphological knowledge-based masking strategy to enhance the model's ability to learn agglutinative knowledge; Late-stage injection not only involves the model restoring the masked tokens but also predicting the order of suffixes. To address the issue of data scarcity, we create a large-scale Mongolian PLM dataset and three datasets for three downstream tasks, that are News Classification, Name Entity Recognition (NER), and Part-of-Speech (POS) prediction, etc. The experimental results on three downstream tasks demonstrate that our method surpasses the traditional BERT approach and successfully learns agglutinative language knowledge in Mongolian.
Loading