Comprehensive Evaluation of Grammatical Error Correction Systems: Including and Beyond Reference-Based Metrics

Comprehensive Evaluation of Grammatical Error Correction Systems: Including and Beyond Reference-Based Metrics

ACL ARR 2025 May Submission2261 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This study addresses three current limitations in Grammatical Error Correction (GEC): the absence of comprehensive evaluation of the newest Large Language Models (LLMs), the reliance on single evaluation metrics for comparative analysis, and the underestimation of system performance by reference-based metrics. We address these limitations first by fine-tuning state-of-the-art LLMs (GPT-4o, LLaMA 3.3 70B) and incorporating these along with zero-shot DeepSeek V3 in an ensemble, which outperformed previous GEC systems in multiple reference-based metrics. We also present the first comprehensive GEC system comparison, evaluating performance across multiple sequence tagging, sequence-to-sequence, and LLM-based approaches using both reference-based and reference-free metrics. Finally, using LLM-as-a-Judge with human validation, we demonstrate that 73.76\% of fine-tuned GPT-4o's corrections which did not match the gold reference are either equally valid grammatically or preferred over the gold reference, revealing that reference-based metrics significantly underestimate GEC system performance.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: educational applications, grammatical error correction

Contribution Types: NLP engineering experiment, Data analysis

Languages Studied: English

Submission Number: 2261

Loading