Abstract: Automated Essay Scoring (AES) holds significant promise in the field of education, helping educators to mark larger volumes of essays and provide timely feedback. However, Arabic AES research has been limited by the lack of publicly available essay data. This study introduces
AR-AES, an Arabic AES benchmark dataset comprising 2046 undergraduate essays, including gender information, scores, and transparent rubric-based evaluation guidelines, providing comprehensive insights into the scoring process. These essays come from four diverse courses, covering both traditional and online exams. Additionally, we pioneer the use of AraBERT for AES, exploring its performance on different question types. We find encouraging results, particularly for Environmental Chemistry and source-dependent essay questions. For the first time, we examine the scale of errors made by a BERT-based AES system, observing that 96.15% of the errors are within one point of the first human marker's prediction, on a scale of one to five, with 79.49% of predictions matching exactly. In contrast, additional human markers did not exceed 30\% exact matches with the first marker, with 62.9% within one mark. These findings highlight the subjectivity inherent in essay grading, and the potential for current AES technology to assist human markers to grade consistently across large classes.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Automated Essay Scoring (AES), Dataset, Arabic, AraBERT
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: Arabic, English
Submission Number: 164
Loading