Collaborative Essay Evaluation with Human and Neural Graders Using Item Response Theory Under a Nonequivalent Groups Design
Abstract: In the assessment of essay writing, reliably measuring examinee ability can be difficult owing to bias effects arising from rater characteristics. To address this, item response theory (IRT) models that incorporate rater characteristic parameters have been proposed. These models estimate the ability of examinees from scores assigned by multiple raters while considering their scoring characteristics, thereby achieving more accurate measurement of ability compared with a simple average of scores. However, issues arise when different groups of examinees are assessed by distinct sets of raters. In such cases, test linking is required to standardize the scale of ability estimates among multiple examinee groups. Traditional test linking methods require administrators to design groups in which either examinees or raters are partially shared—a requirement that is often impractical in real-world assessment settings. To overcome this problem, we introduce a novel linking method that does not rely on common examinees and raters by utilizing a recent automated essay scoring (AES) method. Our method not only facilitates test linking but also enables effective collaboration between human raters and AES, which enhances the accuracy of ability measurement.
Loading