Rank-Then-Score: Toward Language-Generalizable Automated Essay Scoring with Large Language Models

ACL ARR 2025 May Submission7772 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In recent years, large language models (LLMs) achieve remarkable success across a variety of tasks. However, their potential in the domain of Automated Essay Scoring (AES) remains largely underexplored. Moreover, compared to the English field, the development of AES in the Chinese field remains very limited. In this paper, we introduce a Chinese AES benchmark, HSK, and propose Rank-Then-Score (RTS), a fine-tuning framework based on LLMs to enhance scoring capabilities especially on Chinese data. Specifically, we fine-tune the ranking model (Ranker) with feature-enriched data, and then feed the output of the ranking model, in the form of a candidate score set, with the essay content into the scoring model (Scorer) to produce the final score. Experimental results on both Chinese and English datasets demonstrate that RTS consistently outperforms the Vanilla fine-tuning method in terms of average Quadratic Weighted Kappa across all LLMs and datasets, and achieves the best performance on Chinese essay scoring on HSK.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: educational applications, essay scoring
Contribution Types: NLP engineering experiment
Languages Studied: Chinese, English
Submission Number: 7772
Loading