Large Language Models and Mathematical Reasoning Failures

Large Language Models and Mathematical Reasoning Failures

ACL ARR 2025 February Submission2199 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies focusing solely on answer correctness, we rigorously analyze both final answers and solution steps to identify reasoning failures. Evaluating eight state-of-the-art models—including Mixtral, Llama, Gemini, GPT-4o, and OpenAI’s o1 variants—we find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic, sometimes producing correct answers via flawed logic. Common failure modes include unwarranted assumptions, over-reliance on numerical patterns, and inability to translate physical intuition into mathematical steps. Manual scrutiny reveals that models struggle with problems requiring multi-step deduction or real-world knowledge, despite possessing broad mathematical knowledge. Our results underscore the importance of evaluating reasoning processes, not just answers, and caution against overestimating LLMs’ problem-solving proficiency. The study highlights persistent gaps in LLMs’ generalization abilities, emphasizing the need for targeted improvements in structured reasoning and constraint handling.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: LLMs, mathematical reasoning, evaluation, reasoning failure

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 2199

Loading