Learning economically for Chinese word segmentation: tuning pretrained model via active learning and N-gram preference

Zhiyuan Ma, Jiwei Qin, Song Tang, Jinpeng Mi, Dan Liu

Published: 01 Jan 2025, Last Modified: 24 Jul 2025Neural Comput. Appl. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In Chinese linguistics, Chinese Word Segmentation (CWS) is a crucial step for downstream tasks, such as translation, information retrieval and text generation, which further contributes to the applications that interact with human in real world complex. Recently, with the advancement of pretrained language models, the performance of CWS has reached a historic peak. In these studies, incorporating n-gram feature into contextual representations can enhance the CWS performance. However, such improvements come at the cost of laborious and expensive annotation, limiting its application in scenarios with scarce labeled data. Therefore, it remains as a challenge to balance performance and cost. To address the issue, we propose an active learning framework based on pretrained language model, and improves the overall performance by enhancing character-level encoder with n-gram preference. Meanwhile, an active learning process based on loss prediction is included to accelerate the training while minimizing labeled data. The experimental results on ten benchmark datasets demonstrate improved performances with limited labeled data, providing a more economical approach for pruning a CWS model in specific domain.