LLMs Fail to Recognize Mathematical Unsolvability

20 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Mathematical Reasoning, Datasets, Benchmark
TL;DR: We present MathTrap300, a benchmark of unsolvable math problems, on which most LLMs exhibit a clear accuracy drop compared to the corresponding well-posed problems, and we further analyze and categorize their common failure patterns.
Abstract: While modern large language models (LLMs) achieve high accuracy on many challenging math benchmarks, they often struggle to recognize the unsolvability of ill-posed problems, a capability that is trivial for humans. To evaluate this capability, we introduce MathTrap300, a dataset of 300 unsolvable, ill-posed math problems with missing or contradictory conditions and conduct comprehensive benchmarking. Our first contribution is the construction of the first large-scale, high-quality benchmark of mathematically unsolvable problems, manually derived from well-posed counterparts with rigorous verification of ill-posedness. Our second contribution is a fine-grained, three-stage LLM judge framework, designed from the observation of LLMs' responses to unsolvability problems. It can capture signals in both final answers and intermediate reasoning, and provide richer metrics, rendering more faithful assessment of unsolvability recognition. Our third contribution is to benchmark recent LLMs on MathTrap300, with a detailed response pattern analysis. A clear drop in accuracy is observed from the original well-posed problems to their unsolvable counterparts. Common failure modes are categorized into hallucination, logical flaws, and neglect of conditions etc. It is also observed that even when models recognize unsolvability, they still attempt to force a solution, a form of sycophantic behavior. The dataset and evaluation code will be publicly released.
Primary Area: datasets and benchmarks
Submission Number: 22714
Loading