Are the Reasoning Models Good at Automated Essay Scoring?

ACL ARR 2025 May Submission331 Authors

11 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This study investigates the validity and reliability of reasoning models, specifically OpenAI's o3-mini and o4-mini, in automated essay scoring (AES) tasks. We evaluated these models' performance on the TOEFL11 dataset by measuring agreement with expert ratings (validity) and consistency in repeated evaluations (reliability). Our findings reveal two key results: (1) the validity of reasoning models o3-mini and o4-mini is significantly lower than that of a non-reasoning model GPT-4o mini, and (2) the reliability of reasoning models cannot be considered high, with Intraclass Correlation Coefficients (ICC) of approximately 0.7 compared to GPT-4o mini's 0.95. These results demonstrate that reasoning models, despite their excellent performance on many benchmarks, do not necessarily perform well on specific tasks such as AES. Additionally, we found that few-shot prompting significantly improves performance for reasoning models, while Chain of Thought (CoT) has less impact.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: automatic evaluation of dataset, evaluation methodologies, statistical testing for evaluation
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 331
Loading