MathTrap300: A Benchmark for Recognizing Mathematical Insolvability in LLMs

Published: 11 Nov 2025, Last Modified: 16 Jan 2026DAI PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Mathematical Reasoning, Datasets, Benchmark
TLDR: We present MathTrap300, a benchmark of insolvable math problems, on which most LLMs exhibit a clear accuracy drop compared to the corresponding well-posed problems, and we further analyze and categorize their common failure patterns.
Abstract: While modern large language models (LLMs) achieve high accuracy on many challenging mathematical benchmarks, they often struggle to recognize the insolvability of ill-posed problems. Existing benchmarks for insolvable problems are either modified from elementary-level math questions or lack rigorous validation of their insolvability. There is still no benchmark featuring inherently insolvable problems that require deep mathematical knowledge to identify. To address this gap, we introduce \textbf{MathTrap300}, the first benchmark consisting of 300 insolvable, ill-posed math problems with fundamental mathematical contradictions or missing conditions that demand deep domain knowledge to detect. In this work, we manually constructed these problems from well-posed counterparts through careful modifications and rigorous verification of ill-posedness by PhD-level experts. We then present a fine-grained, three-stage LLM judge framework, designed based on observations of LLM responses to insolvable problems. This framework captures signals from both final answers and intermediate reasoning, providing richer metrics and enabling a more faithful assessment of insolvability recognition. Our evaluation of recent advanced LLMs on MathTrap300, combined with a detailed analysis of their response patterns, reveals a clear drop in accuracy from well-posed problems to their insolvable counterparts. Common failure modes include hallucination, guessing, and condition neglect. We also observe that even when models recognize insolvability, they still produce a definitive answer.
Submission Number: 47
Loading