Structural-Semantic Constraints for Enhanced Chinese Language Modeling

Published: 15 Nov 2025, Last Modified: 08 Mar 2026AAAI 2026 Bridge LMReasoningEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chinese pre-training, language model, natural language processing, representation learning
TL;DR: We propose a general Chinese pre-training optimization scheme to incorporates both word-level and structure-level semantics of Chinese characters into language model, alleviating imbalanced word-level pre-training.
Abstract: Most Chinese pre-training studies follow word-level strategies of English pre-training. These studies do not consider the exposure imbalance of Chinese characters, resulting in imbalanced performance on different downstream tasks. To address above issues, we propose a structure semantic constraints of Chinese Characters for enhanced language modeling. Since a Chinese character is composed of several structure units (components, strokes and composite types), the structure semantic constraints explore Chinese structure-level semantics via deconstructing and reconstructing between Chinese characters and structure units. In contrast to MLM, which learns the global semantics of characters, this task is designed to focus on their local semantic representations. Different level representation tasks help model performs well in fine-grained Chinese representation, alleviating imbalanced word-level pre-training by balanced structure-level pre-training. In terms of experiments, we implement structure semantic constraints on the BERT architecture. The proposed model achieves overall performance improvement on multiple Chinese NLP tasks. Experimental results and analysis demonstrate the effectiveness of proposed scheme in Chinese pre-training.
Submission Number: 4
Loading