Are the Reasoning Models Good at Automated Essay Scoring?

ACL ARR 2025 February Submission5030 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This study investigates the performance of reasoning models (OpenAI’s o1-mini and o3-mini) in automated essay scoring (AES) tasks. While these models demonstrate superior performance across various benchmarks, their effectiveness in AES applications remains unexplored. We conducted two experiments using the TOEFL11 dataset: (1) examining scoring consistency by having models evaluate identical essays 50 times, and (2) comparing their scoring accuracy against human expert assessments using Quadratic Weighted Kappa (QWK). Our results reveal that conventional models like GPT-4o mini outperform newer reasoning models in AES tasks, achieving significantly higher QWK scores (0.619 vs 0.454 and 0.442). Additionally, we found that reasoning models show scoring inconsistencies. These findings suggest that benchmark performance improvements may not translate directly to specialized tasks like essay evaluation, highlighting the importance of task-specific assessment in model selection for practical applications.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: automatic evaluation of dataset, evaluation methodologies, statistical testing for evaluation
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 5030
Loading