SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models

ICLR 2026 Conference Submission2147 Authors

04 Sept 2025 (modified: 23 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Automated Assessment, Short Answer Scoring, LLM-as-a-Judge
Abstract: Short Answer Scoring (SAS) is a critical task in automated subjective answer grading, playing an essential role in education, standardized testing, and large-scale assessment systems. However, existing approaches often produce coarse-grained scores and lack detailed reasoning. Although large language models (LLMs) have demonstrated potential as zero-shot evaluators, they remain susceptible to bias, inconsistencies with human judgment, and limited transparency in scoring decisions. To overcome these limitations, we introduce SAS-Bench, a benchmark specifically designed for LLM-based SAS tasks. SAS-Bench provides fine-grained, step-wise scoring, expert-annotated error categories, and a diverse range of question types derived from real-world subject-specific exams. This benchmark facilitates detailed evaluation of model reasoning processes and explainability. We also release an open-source dataset containing 1,030 questions and 4,109 student responses, each annotated by domain experts. Furthermore, we conduct comprehensive experiments with various LLMs, identifying major challenges in scoring science-related questions and highlighting the effectiveness of few-shot prompting in improving scoring accuracy. Our work offers valuable insights into the development of more robust, fair, and educationally meaningful LLM-based evaluation systems.
Primary Area: datasets and benchmarks
Submission Number: 2147
Loading