Abstract: Although large language models (LLMs) show promise in solving complex mathematical tasks, existing evaluation paradigms rely solely on a coarse measure of overall answer accuracy, which are insufficient for assessing their authentic capabilities. In this paper, we propose \textbf{CogMath}, which comprehensively assesses LLMs' mathematical abilities through the lens of human cognition. Specifically, inspired by psychological theories, CogMath formalizes human reasoning process into 3 stages: \emph{problem comprehension}, \emph{problem solving}, and \emph{solution summarization}. Within these stages, we investigate perspectives such as numerical calculation, knowledge, and counterfactuals, and design a total of 9 fine-grained evaluation dimensions. In each dimension, we develop an ``\emph{Inquiry}-\emph{Judge}-\emph{Reference}'' multi-agent system to generate inquiries that assess LLMs' mastery from this dimension. An LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions. By applying CogMath on three benchmarks, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30\%-40\%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in-depth insights to further enhance their reasoning abilities.
Lay Summary: Large language models (LLMs) have shown impressive performance on a wide range of mathematical reasoning tasks. However, current evaluations only check if the final answer is right or wrong — a rough metric that fails to reflect what the model truly masters.
We introduce CogMath, a new evaluation framework that assesses LLMs' math abilities through the lens of human cognition. Inspired by psychology, CogMath breaks down reasoning into three stages: understanding the problem, solving the problem, and summarizing the solution. Across these stages, we design 9 fine-grained evaluation dimensions, covering aspects like calculation, factual knowledge, and counterfactual reasoning.
When applied to seven representative LLMs, CogMath reveals their mathematical abilities may be overestimated by 30%–40%. Our results also pinpoint strengths and weaknesses of each model, offering insights to guide the development of more trustworthy reasoning systems.
Link To Code: https://github.com/Ljyustc/CogMath
Primary Area: General Machine Learning->Evaluation
Keywords: Large Language Models, Human Cognition, Evaluation
Flagged For Ethics Review: true
Submission Number: 9449
Loading