Benchmarking Large Language Models for Diagnosing Students’ Cognitive Skills from Handwritten Math Work

ACL ARR 2026 March Submission1439 Authors

16 Mar 2026 (modified: 07 Jun 2026)ACL ARR 2026 March SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models in Math; Cognitive Skill Diagnosis; Benchmark Dataset for Educational AI
Abstract: Diagnosing students' cognitive skills from their handwritten math work is essential for personalized learning, as such work captures intermediate reasoning beyond final answers. Yet interpreting it remains labor-intensive for teachers, and automating it with LLMs is far from straightforward. Unlike mathematical problem-solving, where LLMs generate solutions themselves, cognitive diagnosis requires models to infer latent reasoning from student work. To systematically investigate this underexplored task, we construct MathCog, a benchmark dataset of 3,036 diagnostic verdicts across 639 student responses to 110 math problems, annotated by teachers using TIMSS-grounded cognitive skill checklists with evidential strength labels (Evident/Vague). Using MathCog, we evaluate 17 open and proprietary LLMs and find that (1) all models underperform (F1 $<$ 0.521) regardless of capability, and (2) performance degrades sharply under vague evidence. Error analysis reveals systematic failure patterns: models misattribute vague evidence as evident, over-infer from minimal cues, and hallucinate nonexistent evidence. These findings highlight fundamental limitations in LLMs' ability to reason under implicit evidential conditions, with direct implications for evidence-aware, teacher-in-the-loop designs in LLM-based cognitive diagnosis.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: NLP Applications, Human-Centered NLP, Interpretability and Analysis of Models for NLP
Contribution Types: Data resources, Data analysis
Languages Studied: English, Korean
Submission Number: 1439
Loading