Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment

Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment

ACL ARR 2025 May Submission3949 Authors

19 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Automated Essay Scoring (AES) systems now attain near–human agreement on public benchmarks, yet real-world adoption—especially in high-stakes examinations—remains limited. A principal obstacle is that most models output a single score without any accompanying measure of confidence or explanation. We address this gap with conformal prediction, a distribution-free wrapper that equips any classifier with set-valued outputs enjoying formal coverage guarantees. Two open-weight large language models—Llama-3 8B and Qwen-2.5 3B—are fine-tuned on three diverse corpora (ASAP, TOEFL11, Cambridge-FCE) and calibrated at a 90% risk level. Reliability is assessed with UAcc, an uncertainty-aware accuracy that rewards models for being both correct and concise. To our knowledge, this is the first work to combine conformal prediction and UAcc for essay scoring. The calibrated models consistently meet the coverage target while keeping prediction sets compact, demonstrating that trustworthy, uncertainty-aware AES is already feasible with mid-sized, open source LLMs and paving the way for safer human-in the-loop marking.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Resources and Evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 3949

Loading