Leveraging Large Language Models for Automated Chinese Essay Scoring

Haiyue Feng, Sixuan Du, Gaoxia Zhu, Yan Zou, Poh Boon Phua, Yuhong Feng, Haoming Zhong, Zhiqi Shen, Siyuan Liu

Published: 01 Jan 2024, Last Modified: 09 Aug 2024AIED (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Automated Essay Scoring (AES) plays a crucial role in offering immediate feedback, reducing the workload of educators in grading essays, and improving students’ learning experiences. With strong generalization capabilities, large language models (LLMs) offer a new perspective in AES. While previous research has primarily focused on employing deep learning architectures and models like BERT for feature extraction and scoring, the potential of LLMs in Chinese AES remains largely unexplored. In this paper, we explored the capabilities of LLMs in the realm of Chinese AES. We investigated the effectiveness of the application of well-established LLMs in Chinese AES, e.g., the GPT-series by OpenAI and Qwen-1.8B by Alibaba Cloud. We constructed a Chinese essay dataset with carefully developed rubrics, based on which we acquired grades from human raters. Then we fed in prompts to LLMs, specifically GPT-4, fine-tuned GPT-3.5 and Qwen to get grades, where different strategies were adopted for prompt generations and model fine-tuning. The comparisons between the grades provided by LLMs and human raters suggest that the strategies to generate prompts have a remarkable impact on the grade agreement between LLMs and human raters. When model fine-tuning was adopted, the consistency between LLMs’ scores and human scores was further improved. Comparative experimental results demonstrate that fine-tuned GPT-3.5 and Qwen outperform BERT in QWK score. These results highlight the substantial potential of LLMs in Chinese AES and pave the way for further research in the integration of LLMs within Chinese AES, employing varied strategies for prompt generation and model fine-tuning.