Exploring Human-judged and Automatically-induced Correction Difficulty for Grammatical Error CorrectionDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: While grammatical error correction (GEC) has improved in its correction performance, one of the key challenges in GEC research still remains in evaluation. Specifically, all errors are equally treated in the conventional performance measures despite the fact that some errors are more difficult to correct than others. Ideally, difficult errors should be regarded to be more important than easy ones in evaluation. This leads to the following ultimate research question --- Can even human experts estimate correction difficulty well? In this paper, we explore questions about correction difficulty centering on this research question. For this purpose, we first introduce a method for estimating agreement rates in correction difficulty judgements based on pairwise comparison. With the annotation of 2,025 instances using this method, we show that human experts exhibit a moderate agreement rate of 66.39\% (Cohen's-$\kappa$: 0.42) in judging correction difficulty. We also show that the agreement between this human-based difficulty and an automatically induced difficulty is comparable (64.50\% and $\kappa=0.35$ on average). We further look into the annotation results to reveal the insights of the human-judged and machine-judged correction difficulties, reporting on following three findings: (i) where the human-judged and machine-judged difficulties are strong and weak; (ii) based on (i), correction difficulty can be GEC-algorithm- and training-corpus-dependent; (iii) human-judged and machine-judged correction difficulties complement each other.
0 Replies

Loading