Benchmarking Large Language Models for Diagnosing Students’ Cognitive Skills from Handwritten Math Work

Benchmarking Large Language Models for Diagnosing Students’ Cognitive Skills from Handwritten Math Work

ACL ARR 2026 March Submission1439 Authors

16 Mar 2026 (modified: 07 Jun 2026)ACL ARR 2026 March SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models in Math; Cognitive Skill Diagnosis; Benchmark Dataset for Educational AI

Abstract: Diagnosing students' cognitive skills from their handwritten math work is essential for personalized learning, as such work captures intermediate reasoning beyond final answers. Yet interpreting it remains labor-intensive for teachers, and automating it with LLMs is far from straightforward. Unlike mathematical problem-solving, where LLMs generate solutions themselves, cognitive diagnosis requires models to infer latent reasoning from student work. To systematically investigate this underexplored task, we construct MathCog, a benchmark dataset of 3,036 diagnostic verdicts across 639 student responses to 110 math problems, annotated by teachers using TIMSS-grounded cognitive skill checklists with evidential strength labels (Evident/Vague). Using MathCog, we evaluate 17 open and proprietary LLMs and find that (1) all models underperform (F1 $<$ 0.521) regardless of capability, and (2) performance degrades sharply under vague evidence. Error analysis reveals systematic failure patterns: models misattribute vague evidence as evident, over-infer from minimal cues, and hallucinate nonexistent evidence. These findings highlight fundamental limitations in LLMs' ability to reason under implicit evidential conditions, with direct implications for evidence-aware, teacher-in-the-loop designs in LLM-based cognitive diagnosis.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: NLP Applications, Human-Centered NLP, Interpretability and Analysis of Models for NLP

Contribution Types: Data resources, Data analysis

Languages Studied: English, Korean

Submission Number: 1439

Loading