How Hard is Math? Using Quantitative Metrics to Measure LLM Alignment to Human Intuitions of Difficulty
Keywords: LLM, alignment, question difficulty, math benchmark, log probabilities
Abstract: LLMs have grown increasingly powerful in their level of detail and accuracy when solving math problems, demonstrating their advanced reasoning capabilities with chain-of-thought (Wei et al., 2022; Kojima et al., 2022). This has led to researchers developing more difficult mathematics benchmarks, such as HARP and Omni-MATH, pushing accuracy scores to their limit (Yue et al., 2024; Gao et al., 2025). Often overlooked is the fact that this requires some definition of “difficulty” (while we note the broad range of definitions for difficulty in mathematics, and the knowledge gap it creates in literature, for the purposes of this paper we will explore the human-annotated difficulty score common to modern mathematics benchmarks), which could have different meanings in the context of mathematics: the total effort and time required to solve a problem; the requisite skills and problem-solving techniques, often based on experience; the uncertainty felt when solving a problem. For modern mathematics benchmarks, which draw problems from Olympiad-level competitions, difficulty is typically annotated by subject matter experts using a scale directly related to the recommended/required age of participants (i.e., AMC-8, AMC-10, and AMC-12, designed for middle school and high school students below grades 8, 10, and 12, respectively). While it is standard to measure the accuracy of an LLM as the difficulty increases, we explore the extent to which quantitative metrics---the number of generated output tokens (effort), whether the problem includes Asymptote language (geometric spatial skills), and probabilities of generated solutions (uncertainty)---of LLM solutions relate to human difficulty ratings. The amount of effort, mathematical skills used, and uncertainty felt when a student is solving a problem, can be used to measure the difficulty of a math problem. Using Pearson correlation and OLS regression, we find that several metrics moderately align to human difficulty ratings---even more strongly than LLM solution accuracy---and show that they have some predictive power for a given problem's difficulty ratings, supporting recent findings by Plaut et al. (2025) and Luo et al. (2025). However, these metrics do not explain all of the variance, suggesting further research is needed into new metrics, such as token and sentence-level semantics.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc.
Albert S. Yue, Lovish Madaan, Ted Moskovitz, DJ Strouse, and Aaditya K. Singh. 2024. Harp: A challenging human-annotated math reasoning benchmark. Preprint, arXiv:2412.08819.
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. 2025. Omni-MATH: A universal olympiad level mathematic benchmark for large language models. In The Thirteenth International Conference on Learning Representations.
Benjamin Plaut, Nguyen X. Khanh, and Tu Trinh. 2025. Probabilities of chat llms are miscalibrated but still predict correctness on multiple-choice q&a. Preprint, arXiv:2402.13213.
Shixian Luo, Zezhou Zhu, Yu Yuan, Yuncheng Yang, Lianlei Shan, and Yong Wu. 2025. Geogrambench: Benchmarking the geometric program reasoning in modern llms. Preprint, arXiv:2505.17653.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 76
Loading