Take Out Your Calculators: Estimating the Real Difficulty of Math Word Problems with LLM Student Simulations

ACL ARR 2025 May Submission5829 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Math word problems used in testing are usually piloted with human subjects to establish the item difficulty and detect differential item function. However these pilots are costly, thus for a need for a less costly approach that evaluates these questions. We show that large-language models to an extent can serve as a valuable first check, to help test developers effectively measure students’ skills on a given subject matter. We do this by prompting Large Language Models(LLMs) to role-play Below-Basic, Basic, Proficient, and Advanced 4th- and 8th-grade students. We also add first names to simulate a more realistic classroom whose aggregate correct/wrong rate serves as a proxy for estimating question difficulty. We observe the simulated student scores align to an extent closely with real student success. We also observe that the individual models contribute different strengths and combining them could improve the correlation compared to using the individual models in some cases.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Synthetic Students, Item-Response Theory, Education, Math education, Learning science
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 5829
Loading