Large Language Models vs. Human Expert: Benchmarking Automated Short Answer Grading on Traditional Chinese Reading Comprehension Examinations Using PIRLS-HK Dataset

ACL ARR 2025 May Submission1099 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) are rapidly finding their way into automated short-answer grading (ASAG) systems, yet we still lack a realistic benchmark for evaluating their reliability on Traditional Chinese K-12 reading-comprehension tasks. Existing ASAG benchmarks either emphasize encyclopaedic or STEM knowledge, are multiple-choice rather than open-response, or target Simplified-Chinese or English, leaving traditional Chinese ASAG task under-explored. We address this gap with PIRLS-HK, a dataset distilled from 15 years of Hong Kong in International Reading Literacy Study (PIRLS) materials. The first release contains 2,352 expert-graded question–answer pairs (25 questions, 4 passages) written in Traditional Chinese by 292 fourth-grade students, each accompanied by the official marking scheme. Using PIRLS-HK we benchmark 11 LLMs under zero-shot and few-shot settings. Performance is measured with Quadratic Weighted Kappa (QWK), Tolerance-Adjusted Accuracy (TAA) and Relative Merit Consensus (RMC). Results show Few-shot mid-sized models (e.g. qwq-32b, BM: 0.674) rival or surpass much larger variants. Full-size models show only marginal gains across prompting modes. Agreement and accuracy with human graders remain modest: the best QWK is 0.383 (deepseek-V3 few-shot) and the highest TAA (τ = 0) is 71.71% (deepseek-r1 zero-shot). These findings indicate that LLMs that excel on main-stream NLP leaderboards may still lack consistency and fairness when confronted with authentic, culturally embedded assessment data. PIRLS-HK provides the first open benchmark for advancing ASAG research in Traditional Chinese; dataset and code will be released under CC-BY-NC 4.0.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: benchmarking, language resources, automatic creation and evaluation of language resources, NLP datasets, automatic evaluation of datasets, evaluation, datasets for low resource languages, metrics
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: Traditional Chinese
Submission Number: 1099
Loading