Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out-of-Distribution Generalization

TMLR Paper6172 Authors

10 Oct 2025 (modified: 22 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent progress has pushed AI frontiers from pattern-recognition tasks toward problems that require step-by-step, System-2-style reasoning, especially with large language models. Yet, unlike learning, where generalization and out‑of‑distribution (OoD) evaluation concepts are well formalized, there is no clear, consistent definition or metric for reasoning ability. We propose Complexity Out‑of‑Distribution (Complexity OoD) generalization as a framework and problem setting to define and measure reasoning. A model exhibits Complexity OoD generalization when it maintains performance on test instances whose minimal required solution complexity, either representational (richer solution structure) or computational (more reasoning steps/program length), exceeds that of all training examples. We formalize complexity via solution description Kolmogorov complexity and operational proxies (e.g., object/relation counts; reasoning‑step counts), clarifying how Complexity OoD differs from length and compositional OoD. This lens unifies learning and reasoning: many cases solvable with System‑1‑like processing at low complexity become System‑2‑like under complexity pressure, while System‑2 can be viewed as generalization over solution structures. We translate this perspective into practice with recommendations for operationalizing Complexity OoD across the stack: incorporating complexity into benchmark and evaluation metric design rethinking supervision to target solution traces (from final outcomes to process‑level feedback and RL/search), seeking and designing inductive biases for Complexity‑OoD generalization, addressing learning‑to‑reason spillovers such as spurious shortcuts, semantic robustness, catastrophic forgetting, and step‑wise calibration. In light of recent controversies over LLM reasoning, we put the problem on firm footing: treat reasoning as Complexity OoD, enabling rigorous evaluation and more systematic research.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=SPPGnJr51F
Changes Since Last Submission: Over the past year, motivated by high‑profile findings (such as illusion of thinking paper) showing that measuring LLM reasoning is non‑trivial, we reframed the paper around Complexity Out‑of‑Distribution (Complexity OoD) as the evaluation lens, aiming to enable more systematic research on reasoning. Key changes since the last TMLR submission: - Reframed the paper as a position/survey with a clear scope. We now explicitly separate “problems/domains and evaluation” from “architectures and specific solutions,” per prior feedback. - Provided a more precise and formal definition of Complexity OoD using Kolmogorov complexity, distinguishing representational complexity (via K(x)) and computational complexity (via K(y|x)). - Clarified “semantic/atomic units”: we define primitives as task‑level building blocks that can be distributed (not necessarily symbolic), and treat “solutions” as programs composed from these primitives. - Added complexity‑binned analyses and experiments on reasoning benchmarks (GSM8K, AIME, Omni‑MATH), showing skewed complexity distributions and accuracy degradation as complexity increases, and highlighting differences between “reasoning‑tuned” and general‑purpose models, supporting our central claim that complexity matters. - Conducted a systematic literature review from four angles: - Representational complexity (e.g., object‑centric learning, emergent communication), - Computational complexity (e.g., adaptive depth/halting, program synthesis), - Advances in LLMs as a bridge between System‑1 and System‑2 for addressing Complexity OoD (CoT, search at inference, PRMs/ORMs, RL), - Recent work on evaluating LLM reasoning with complexity‑conditioned protocols. - Editorial and structural improvements: the paper is substantially reorganized and streamlined relative to the prior submission. - Expanded recommendations for future research with Complexity OoD in mind across four fronts: tasks and benchmarks, supervision paradigms, inductive biases, and newly framed problem settings. This submission is a ground‑up rewrite. We reconstruct the problem definition, unify disparate threads under the Complexity OoD lens, and operationalize evaluation via complexity‑aware analyses and metrics. Given the surge of claims about LLM reasoning and mounting evidence of contamination and metric brittleness, a principled, complexity OoD framework is both timely and necessary. We believe these revisions provide a stronger, clearer foundation for rigorous, systematic research on reasoning.
Assigned Action Editor: ~Kevin_Swersky1
Submission Number: 6172
Loading