ALICE: A Large-Scale German Benchmark for Rubric-Based Multi-Dimensional Automatic Short Answer Scoring
Keywords: NLP4Edu, Automatic Short Answer Scoring, Benchmark Evaluation
Abstract: Automatic Short Answer Scoring (ASAS) is central to NLP for Education. However, openly available benchmarks remain scarce, and existing datasets largely assess how well students answer a question directly rather than how well they master underlying concepts (knowledge elements) such as thermal energy or skills (epistemic activities) such as reasoning or claiming. To address this gap, we introduce \textsc{ALICE}, a large-scale German ASAS dataset comprising three subtasks: (i) learning performance ALICE-LP, (ii) knowledge elements ALICE-KE, and (iii) skills ALICE-SK. We further frame ASAS as a rubric-ranking task and benchmark with a range of language models, from XLM-Roberta to several lightweight LLMs, under various input configurations. Our experiments show that lightweight LLMs used as encoders are particularly effective for rubric-based ASAS. We also investigate what combinations of context information (rubrics, prompts, sample solutions) are beneficial for ASAS.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation; benchmarking; language resources; evaluation; datasets for low resource languages;
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: German, English
Submission Number: 9708
Loading