Abstract: Chinese Spelling Correction is to detect errors in a statement and correct them. Researchers have recently begun to investigate the use of pre-train language models to perform this task. They replace tokens with similar characters based on a confusion set rather than mask tokens when pre-training. A significant limitation of this method is that pre-training with single tokens cannot effectively simulate natural errors, such as full spelling. Furthermore, the confusion set consists of single characters, which lacks the contextual relationship between characters. This leads to a lower quality of generated pseudo-labeled samples. To this end, we contribute a novel approach of constructing confusion corpus, which can automatically generate high-quality spelling errors. The construction of our confusion set is based on user behavior patterns, aiming to identify characters that are easily confused in similar contexts. Then we introduce a Span Confusion Pre-Train(SCPre) strategy for Chinese Spelling Correction. The span confusion strategy replaces characters, words, and phrases in the text according to a large confusion set. Upon the constructed corpus, different models are trained and evaluated for CSC with respect to real-world datasets. Quantitative experiments on datasets show that our approach empirically achieves unprecedented performance.
Loading