CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective

Jiayu Liu; Zhenya Huang; Wei Dai; Cheng Cheng; Jinze Wu; Jing Sha; Song Li; Qi Liu; Shijin Wang; Enhong Chen

CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective

Jiayu Liu, Zhenya Huang, Wei Dai, Cheng Cheng, Jinze Wu, Jing Sha, Song Li, Qi Liu, Shijin Wang, Enhong Chen

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Although large language models (LLMs) show promise in solving complex mathematical tasks, existing evaluation paradigms rely solely on a coarse measure of overall answer accuracy, which are insufficient for assessing their authentic capabilities. In this paper, we propose \textbf{CogMath}, which comprehensively assesses LLMs' mathematical abilities through the lens of human cognition. Specifically, inspired by psychological theories, CogMath formalizes human reasoning process into 3 stages: \emph{problem comprehension}, \emph{problem solving}, and \emph{solution summarization}. Within these stages, we investigate perspectives such as numerical calculation, knowledge, and counterfactuals, and design a total of 9 fine-grained evaluation dimensions. In each dimension, we develop an ``\emph{Inquiry}-\emph{Judge}-\emph{Reference}'' multi-agent system to generate inquiries that assess LLMs' mastery from this dimension. An LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions. By applying CogMath on three benchmarks, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30\%-40\%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in-depth insights to further enhance their reasoning abilities.

Lay Summary: Large language models (LLMs) have shown impressive performance on a wide range of mathematical reasoning tasks. However, current evaluations only check if the final answer is right or wrong — a rough metric that fails to reflect what the model truly masters. We introduce CogMath, a new evaluation framework that assesses LLMs' math abilities through the lens of human cognition. Inspired by psychology, CogMath breaks down reasoning into three stages: understanding the problem, solving the problem, and summarizing the solution. Across these stages, we design 9 fine-grained evaluation dimensions, covering aspects like calculation, factual knowledge, and counterfactual reasoning. When applied to seven representative LLMs, CogMath reveals their mathematical abilities may be overestimated by 30%–40%. Our results also pinpoint strengths and weaknesses of each model, offering insights to guide the development of more trustworthy reasoning systems.

Link To Code: https://github.com/Ljyustc/CogMath

Primary Area: General Machine Learning->Evaluation

Keywords: Large Language Models, Human Cognition, Evaluation

Flagged For Ethics Review: true

Submission Number: 9449

Loading