Using Large Language Models To Diagnose Math Problem-solving Skills At Scale

Hyoungwook Jin, Yoonsu Kim, Yeon Su Park, Bekzat Tilekbay, Jinho Son, Juho Kim

Published: 09 Jul 2024, Last Modified: 04 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Personalized feedback, tailored to students' needs and prior knowledge, is essential for fostering mathematical problem-solving skills. However, personalized feedback is often limited to one-to-one tutoring or small classrooms as it requires instructors' in-depth diagnosis of cognitive processes employed in students' answers. We propose a large language model (LLM) pipeline that diagnoses students' problem-solving skills from their answers at scale in elementary school math word problems. Based on prior literature and an interview with a math education expert, we developed PERC, a framework composed of four problem-solving stages that students can follow: Parse, Extract, Retrieve, and Combine. The framework facilitates diagnosis by externalizing students' step-by-step problem-solving processes and allowing our pipeline to analyze each stage individually. Our LLM pipeline diagnoses each stage by (1) generating rubrics and (2) comparing students' answers with the rubrics. We fine-tuned our LLM pipeline with 71 math problem-rubric pairs and 128 problem-answer-grade triplets collected from elementary school students. We evaluated our pipeline's diagnosis accuracy against vanilla GPT-3.5 and vanilla GPT-4 with automatic and expert evaluations. The results showed the potential of our approach in improving the end-to-end diagnosis accuracy of LLMs, and expert evaluation provided specific aspects that should be improved.

External IDs:doi:10.1145/3657604.3664697