Abstract: Recent advances in large language models (LLMs) have significantly enhanced the performance of automated essay scoring (AES). However, relying on a single LLM often results in inconsistent evaluations due to its inherent biases. We introduce ADBIAS, a novel multi-agent AES framework designed to systematically identify and mitigate model-specific biases across multiple LLMs—namely, GPT-4o, Claude 3.5 Sonnet, LLaMA 4 Maverick, and Gemini 2.5 Flash. ADBIAS follows a three-stage process: (1) generating trait-level scores and rationales from each LLM, (2) quantifying scoring tendencies using the Many-Facet Rasch Model (MFRM), and (3) producing final scores via a bias-aware Meta-LLM that integrates metadata including bias information. Empirical results on the ASAP and ASAP++ datasets show that ADBIAS improves scoring accuracy (+6.4% QWK) and substantially reduces bias variance (–57.9%) compared to both single-model and ensemble baselines. By incorporating explicit bias modeling and calibrated aggregation, ADBIAS advances the reliability, fairness, and interpretability of LLM-based essay evaluation.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/fairness evaluation, model bias/unfairness mitigation, essay scoring, interactive and collaborative generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Previous URL: https://openreview.net/forum?id=hGWEypJ4G5
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: No
A2 Elaboration: this paper exclusively addresses model-specific bias (rater bias) and does not include discussions on potential risks such as misuse, fairness, or environmental impact.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Sec. 3 Dataset (L205–207); Table 9
B2 Discuss The License For Artifacts: No
B2 Elaboration: The manuscript and GitHub repository do not yet specify a license or terms of use for the code. (Dataset licenses are described in Sec. 3 Dataset, L221–238.) We plan to release the full code under an open‑source license (e.g., MIT) within two months and will update the paper accordingly in the camera‑ready version.
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Sec. 3 Dataset (L239–241)
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: The datasets used in this study, ASAP and ASAP++, are pre-existing and publicly available resources. Our paper does not discuss the specific steps taken by the original data providers to check for personally identifiable information (PII) or offensive content, nor does it detail their anonymization methods. As we did not collect these datasets ourselves, we relied on their established use in academic research. We did not perform additional checks for PII or offensive content beyond what may have been done by the original creators.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Sec. 3 Dataset (L214–217); Table 1
B6 Statistics For Data: Yes
B6 Elaboration: Sec. 3 Dataset (L214–217); Table 1
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 4.1 Overview of ADBIAS L252-275; Section 4.2 Model Selection L277-291
C2 Experimental Setup And Hyperparameters: No
C2 Elaboration: No We did not perform hyperparameter search or tuning, as our framework uses pretrained LLMs in a zero-shot setting with fixed temperature and prompts. We reported our deterministic setup in Section 4.1
C3 Descriptive Statistics: Yes
C3 Elaboration: Yes-Table 2; Table 3
C4 Parameters For Packages: No
C4 Elaboration: While our paper discusses the Large Language Models (LLMs) used (GPT-40, Claude 3.5 Sonnet, LLaMA4 Maverick, Gemini 2.5 Flash) and their operational settings (e.g., temperature = 0), and outlines the primary methodology (Many-Facet Rasch Model implemented with PyTorch) and evaluation metrics (Quadratic Weighted Kappa, One-way ANOVA, Levene's test), we did not provide specific version numbers for all underlying software packages or detailed URLs/citations for the exact implementations of common metrics and statistical tests. Additionally, any modifications to existing libraries are not explicitly discussed. Our code repository, however, is publicly available and contains the specific implementation details for our framework.
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: Yes
D3 Elaboration: Sec. 3 Dataset (L209-212)
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: No
D5 Elaboration: Our paper utilizes the pre-existing ASAP and ASAP++ datasets. These datasets were curated by others, and the original data sources do not provide detailed information regarding the basic demographic and geographic characteristics of the annotator population, nor do they specify whether the data contains protected information. Consequently, our paper does not discuss these aspects, nor is it accompanied by a data statement describing these characteristics
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: We used ChatGPT to assist with grammar correction and rephrasing. All scientific content was authored and verified by the authors.
Author Submission Checklist: yes
Submission Number: 264
Loading