Comparing and Combining Claude, GPT-3.5 and GPT-4 Large Language Models in the Correction of Finnish Learner Texts
Abstract: This paper studies grammatical error correction on challenging authentic Finnish learner texts at CEFR A1 level. Three state-of-the-art large language models are compared, and it is shown that GPT-4 outperforms GPT-3.5, which in turn outperforms Claude v1 on this task. Additionally, various ensemble models combining outputs of multiple single models are evaluated. The best results are obtained by explicitly modeling agreement between single models as a chain of rules in an asymmetric decision tree. The best performing ensemble model obtains an accuracy of 85.7%, whereas the best single model, which is a GPT-4 model, reaches an accuracy of 82.4% fully correct sentences. In other words, the ensemble model reduces the sentence error rate by 18.8% in comparison to the best single model.
Paper Type: long
Research Area: Generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: Finnish
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading