Driving Chinese Spelling Correction from a Fine-Grained Perspective

ACL ARR 2024 June Submission3095 Authors

15 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper explores the task: Chinese spelling correction (CSC), from a fine-grained perspective by recognizing that existing evaluations lack nuanced typology for the spelling errors. This deficiency can create a misleading impression of models’ performance, incurring an “invisible” bottleneck hindering the advancement of CSC research. In this paper, we first categorize spelling errors into six types and conduct a fine-grained evaluation across a wide variety of models, including BERT-based models and LLMs. Thus, we are able to pinpoint the underlying weaknesses of existing state-of-the-art models - utilizing contextual clues and handling co-existence of multiple typos, associated to contextual errors and multi-typo errors. However, these errors suffer from low occurrence in conventional training corpus. Therefore, we introduce new error generation methods to synthesize their occurrence. Eventually, these augmented data can be leveraged to enhance the training process of CSC models. We hope this work could provide fresh insight for future CSC research.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Chinese spelling correction
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: Chinese
Submission Number: 3095
Loading