Abstract: Grammatical Error Correction (GEC) is a crucial technique in Automated Essay Scoring (AES) for evaluating the fluency of essays. However, in Chinese, existing GEC datasets often fail to consider the importance of specific grammatical error types within compositional scenarios, lack research on data collected from native Chinese speakers, and largely overlook cross-sentence grammatical errors. Furthermore, the measurement of the overall fluency of an essay is often overlooked. To address these issues, we present CEFA (Chinese Essay Fluency Assessment), an extensive corpus that is derived from essays authored by native Chinese-speaking primary and secondary students and encapsulates essay fluency scores along with both coarse and fine-grained grammatical error types and corrections. Experiments employing various benchmark models on CEFA substantiate the challenging nature of our dataset. Our findings further highlight the significance of fine-grained annotations in fluency assessment and the mutually beneficial relationship between error types and corrections. We will make the corpus and related codes available for research.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation;language resources;NLP datasets;educational applications;GEC;essay scoring;
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: Chinese
Submission Number: 4578
Loading