Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity

Published: 28 Dec 2025, Last Modified: 08 Mar 2026AAAI 2026 Bridge LMReasoningEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Safety, LLM Reasoning, Faithfulness, AI Evaluation
TL;DR: Small reasoning models often look mathematically competent, but our diagnostics show they mainly rely on pattern matching rather than genuine logical computation.
Abstract: Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity—while the model achieves reasonable answer accuracy (70\%+), it demonstrates poor backward consistency (15\%), limited transitivity coverage (32.2\%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.
Submission Number: 105
Loading