Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise

Keno Harada; Lui Yoshida; Takeshi Kojima; Yusuke Iwasawa; Yutaka Matsuo

Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise

Keno Harada, Lui Yoshida, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Published: 18 May 2026, Last Modified: 19 May 2026CoNLL 2026 ArchivalEveryoneRevisionsBibTeXCC BY 4.0

Keywords: rubric development, agreement, essay scoring

TL;DR: We introduce an iterative process where LLMs automatically refine essay scoring rubrics by analyzing their own scoring errors, improving their alignment with human graders.

Abstract: Large Language Models (LLMs) are increasingly used for Automated Essay Scoring (AES), yet the scoring rubrics they rely on are typically designed for human raters and may not be optimal for LLMs. Inspired by the calibration process that human raters undergo before formal scoring, we propose Reflect-and-Revise, an iterative framework that refines scoring rubrics by prompting models to reflect on their own chain-of-thought rationales and score discrepancies with human labels. At each iteration, the model identifies scoring-error patterns from sampled mismatches and revises the rubric accordingly. Experiments on three essay scoring benchmarks (ASAP, ASAP 2.0, and TOEFL11) with three LLMs (GPT-5 mini, Gemini 3 Flash, and Qwen3-Next-80B-A3B-Instruct) demonstrate that our method yields improvements in Quadratic Weighted Kappa (QWK), achieving gains of up to +0.403 over human-authored rubrics. Starting from a minimal seed rubric that specifies only the score scale, our method matches or exceeds expert rubric performance in most dataset-model combinations, indicating that iterative refinement can reduce the manual effort of rubric authoring. Analysis of the refined rubrics reveals that the refinement process introduces explicit procedural structures, such as conditional gating rules and quantitative thresholds, that are absent from human-authored rubrics, highlighting a gap between rubrics designed for human raters and those effective for LLMs.

Scope Confirmation: To the best of my judgment, this submission falls within the scope of CoNLL.

Primary Area Selection: Theoretical Analysis and Interpretation of ML Models for NLP

Other Primary Area: Discourse, Pragmatics, and Reasoning

Use Of Generative Artificial Intelligence Tools: Yes, for editing/proofreading the manuscript, Yes, for writing code

Data Collection From Human Subjects: No

Submission Type: Archival: I certify that the submission has not been previously published, nor is the material in it under review by another journal or conference. Further, no material in it will be submitted for review at another conference or journal while under review by CoNLL 2026.

Submission Number: 268

Loading