PROBE: Benchmarking Reasoning Paradigm Overfitting in Large Language Models

ICLR 2026 Conference Submission25298 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Benchmark Evaluation
Abstract: The reliability of reasoning benchmarks for Large Language Models (LLMs) is threatened by overfitting, which leads to inflated scores that misrepresent true capability. While existing benchmarks focus on surface-level perturbations, they fail to detect a more profound form of overfitting where models memorize problem-specific reasoning paradigms rather than developing generalizable and dynamic logical skills. To address this, we introduce PROBE (Paradigm-ReOriented Benchmark for overfitting Evaluation), a novel benchmark designed to systematically assess this limitation. PROBE introduces variants that force a shift in the core reasoning paradigm—such as simplification, introducing unsolvability, or changing the fundamental solution approach—alongside conventional transformations. Our evaluation of state-of-the-art LLMs on PROBE reveals significant reasoning paradigm overfitting: while models achieve an average accuracy of 81.57\% on original problems, their performance drops substantially to 63.18\% on PROBE, with a striking low score of 35.08\% on the most challenging Unsolvability type. Our work highlights the necessity for benchmarks that probe deeper into reasoning generalization and provides a tool for fostering more robust LLMs.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 25298
Loading