Detecting generalization deficits in large language and reasoning models by using natural variations in simple problems
Abstract: Large language and reasoning models (LLMs, LRMs) are instances of foundation models exhibiting scaling laws that predict generalization improvement when increasing the pre-training scale. As such, they are supposed to possess strong generalization and therefore transfer robustly across various tasks and conditions in few-show or zero-shot manner. Such claims rely on various standardized benchmarks that should measure core functions like generalization and reasoning, where state-of-the-art (SOTA) models score high. We demonstrate remarkable zero-shot generalization deficit in most SOTA models which claim strong function, including reasoning models like DeepSeek R1 or o1-mini, trained at the largest scales, using a simple, short common sense math problem formulated in concise natural language, easily solvable by humans, which we term Alice in Wonderland, AIW, problem. The deficit manifests in strong performance fluctuations on natural variations in the simple problem template that do not change either problem structure or its difficulty at all. By testing models on further control problems with similar form, we rule out that deficit might be rooted in minor low-level issues like natural language or numbers parsing. In conventional LLMs, we observe strong overconfidence in the wrong solutions, expressed in form of plausible sounding explanation-like confabulations. Many models showing the deficit also collapse close to 0 accuracy on AIW problems, while still exhibiting high scores on various standardized benchmarks. We show how this illusion of strong function might be caused by leakage of test sets into training. For reasoning models, while observing clearly improved performance compared to LLMs, we still see strong fluctuations on problem variations that keep structure and difficulty unchanged. Our observations suggest that current LLMs and LRMs possess generalization deficits that can be detected by controlled structure and difficulty preserving variations in simple problems, in contrast to standardized benchmarks which contain problems of higher difficulty but fail to detect such clear deficits. Code for reproducing experiments in the paper and raw experiments data can be found at https://anonymous.4open.science/r/AITW_anonymous-69A6
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Re-focusing text on how natural variations of simple problems can be used to detect generalization deficits overlooked by standardized benchmarks in both LLMs and LRMs.
- changing title to re-focus
- re-writing introduction, parts of discussion and conclusion
- improving description of control experiments
- providing new table to show discrepancy between reasoning benchmarks MATH-500, AIME24 and GPQA-Diamond
- Shortening long sentences
- Adding discussion on Shojaee et al 2025 to emphasize novelty and point to differences in our work
Assigned Action Editor: ~Elahe_Arani1
Submission Number: 5541
Loading