WrAFT: A Modular Large Language Model-Powered Automated Writing Evaluation System for Argumentative Essays
Abstract: This study presents WrAFT, a Writing Assessment and Feedback Tool, that delivers both accurate and reliable scores and effective comprehensive feedback to argumentative essays. WrAFT adopts a modular design by dividing the automated writing evaluation (AWE) tasks into scoring, surface-level feedback and deep-level feedback modules. In building the system, we evaluated various large language models (LLMs), including LLaMA-3.3-70B-Instruct, GPT-4o, and Claude 3.7, through both direct prompting and supervised fine-tuning approaches. An exclusive dataset of 480 TOEFL Independent Writing essays with official benchmark scores was utilized. Our evaluation demonstrates that WrAFT achieves state-of-the-art performance in scoring with a quadratic weighted kappa (QWK) of 0.84 and an root mean square error (RMSE) of 0.44 against benchmark scores on a score scale of 0-5. System-generated feedback also receives high approval ratings from human evaluators (96.14% for surface-level, 93.03% for deep-level macro feedback, and 94.69% for deep-level micro feedback). An interactive user interface has been developed for the system, publicly available and free to use.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: educational applications, essay scoring
Contribution Types: Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 1806
Loading