ALICE: A Large-Scale German Benchmark for Rubric-Based Multi-Dimensional Automatic Short Answer Scoring

ALICE: A Large-Scale German Benchmark for Rubric-Based Multi-Dimensional Automatic Short Answer Scoring

ACL ARR 2026 January Submission9708 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: NLP4Edu, Automatic Short Answer Scoring, Benchmark Evaluation

Abstract: Automatic Short Answer Scoring (ASAS) is central to NLP for Education. However, openly available benchmarks remain scarce, and existing datasets largely assess how well students answer a question directly rather than how well they master underlying concepts (knowledge elements) such as thermal energy or skills (epistemic activities) such as reasoning or claiming. To address this gap, we introduce \textsc{ALICE}, a large-scale German ASAS dataset comprising three subtasks: (i) learning performance ALICE-LP, (ii) knowledge elements ALICE-KE, and (iii) skills ALICE-SK. We further frame ASAS as a rubric-ranking task and benchmark with a range of language models, from XLM-Roberta to several lightweight LLMs, under various input configurations. Our experiments show that lightweight LLMs used as encoders are particularly effective for rubric-based ASAS. We also investigate what combinations of context information (rubrics, prompts, sample solutions) are beneficial for ASAS.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: corpus creation; benchmarking; language resources; evaluation; datasets for low resource languages;

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: German, English

Submission Number: 9708

Loading