Enhancing news classification: domain-specific guided pretraining based on adaptive selective masking

Qiao Ding, Heng Ding, Jian Wang, Yantuan Xian, Nanyu Li, Tao Li, Tao Fang, Junyang Chen

Published: 07 Apr 2026, Last Modified: 27 May 2026Knowledge-Based SystemsEveryoneCC BY-NC-ND 4.0

Abstract: With the ongoing advancement of natural language processing technology, pretrained language models (PLMs) have achieved impressive results across a range of tasks. However, in the context of news text classification, PLMs continue to face challenges such as limited task specificity and insufficient labeled data. To address these issues, this paper proposes an enhanced news classification framework that introduces an intermediate, domain-specific pretraining stage between the standard pretraining and fine-tuning phases. This stage uses a moderately sized, unsupervised news dataset to help the model acquire domain-specific knowledge. An adaptive selective masking mechanism is also employed to dynamically mask key terms, enabling the model to better capture task-relevant information. In addition, the paper presents a method for cleaning mislabeled samples and reweighting the training process, further improving classification performance. Experimental results on four benchmark datasets show that, compared to several advanced baseline models, the proposed approach improves average accuracy by 4.0% and increases training efficiency by approximately 50%. Moreover, additional experiments on other tasks show that the method achieves over 90% accuracy, demonstrating strong generalization capabilities.