Abstract: Existing Chinese text error detection mainly focuses on spelling errors and simple grammatical errors. These errors have been studied extensively and are relatively simple for humans. Chinese Semantic Error Recognition (CSER) pays attention to more complex semantic errors that humans cannot easily recognize compared with Chinese text error detection. Considering the complex syntactic relation between words, we find that syntactic structure from the syntax tree can help identify semantic errors. In this paper, we consider adopting the pre-trained models to solve the task of CSER. To make the model learn syntactic structure in the pre-training stage, we designed a novel pre-training task to predict the syntactic structure from the syntax tree between different words. Due to the lack of a published dataset for CSER, we build a high-quality dataset for CSER for the first time named Corpus of Chinese Linguistic Semantic Acceptability (CoCLSA), which is extracted from the high school examinations. The experimental results on the CoCLSA show that our pre-trained model based on the new pre-training task has a positive performance compared with existing pre-trained models.
0 Replies
Loading