Alice in Wonderland: Variations in Simple Problems Reveal Severe Generalization Deficits in Large Language and Reasoning Models
Abstract: Large language and reasoning models (LLMs, LRMs) are instances of foundation models exhibiting scaling laws that predict generalization improvement when increasing the pre-training scale. As such, they are supposed to possess strong generalization and therefore transfer robustly across various tasks and conditions in few-show or zero-shot manner. Such claims rely on various standardized benchmarks that should measure core functions like generalization and reasoning, where state-of-the-art (SOTA) models score high. We demonstrate here a severe breakdown of zero-shot generalization in most SOTA models which claim strong function, including reasoning models like DeepSeek R1 or o1-mini, trained at the largest scales, using a simple, short common sense problem formulated in concise natural language, easily solvable by humans (AIW problem). The breakdown is severe as it manifests on a simple problem in both low average performance and, importantly, in strong performance fluctuations on natural variations in problem template that do not change either problem structure or its difficulty at all. By testing models on further control problems with similar form, we rule out that breakdown might be rooted in minor low-level issues like natural language or numbers parsing. In conventional LLMs, we observe strong overconfidence in the wrong solutions, expressed in form of plausible sounding explanation-like confabulations. We use these observations to stimulate re-assessment of the capabilities of current generation of LLMs and LRMs as claimed by standardized language understanding and reasoning benchmarks. Such re-assessment also requires common action to establish benchmarks that would allow proper detection of such deficits in generalization and reasoning that remain undiscovered by current evaluation procedures, where models with clear deficits still manage to score high. We discuss how this illusion might be caused by leakage of test sets into training, and how procedural test problem generation can alleviate this. Code for reproducing experiments in the paper and raw experiments data can be found at https://anonymous.4open.science/r/AITW_anonymous-69A6
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Elahe_Arani1
Submission Number: 5541
Loading