The Cliff Effect: Probing the Boundary Between Memorization and Reasoning in LLM Numerical Computation

ACL ARR 2026 January Submission10259 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, numerical reasoning, arithmetic, memorization, compositional reasoning, benchmark evaluation, LLM evaluation, mathematical reasoning, hallucination, model interpretability
Abstract: Recent studies have documented "accuracy collapse" in Large Language Models on complex reasoning tasks -- planning puzzles, multi-hop questions, mathematical word problems -- fueling debate about whether LLMs can reason or merely pattern-match. However, these tasks conflate numerical computation with spatial reasoning, language parsing, and knowledge retrieval, leaving open whether failures stem from these auxiliary demands or from mathematical processing itself. We address this gap by probing the most fundamental level: can LLMs reliably compose elementary arithmetic operations? Using a carefully constructed benchmark that systematically varies difficulty across controlled dimensions -- operation count, digit magnitude, and precision requirements -- we evaluate four frontier models on 1,152 test cases spanning eight categories of numerical computation. We document a striking ``cliff effect'': accuracy collapses from 72.6\% on easy tasks to 33.0\% at medium difficulty -- a 40 percentage point drop in a single step -- then plateaus through harder levels. Statistical analysis confirms that easy-task performance differs significantly from all other difficulty levels, while medium, hard, and expert levels do not differ significantly from each other. This step-function pattern -- strong performance followed by sudden collapse rather than gradual degradation -- mirrors findings from puzzle-based evaluations but emerges here at the level of basic arithmetic, suggesting a general architectural limitation rather than task-specific failure. Error analysis reveals that 81.6\% of incorrect answers are rounding issues rather than reasoning failures, though categories requiring exact symbolic manipulation (e.g., fraction addition) show 100\% genuine calculation errors.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: mathematical reasoning, numerical computation, arithmetic, LLM evaluation, compositional reasoning, memorization, benchmark
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 10259
Loading