LLMs Can’t Generalize Far on Grade-School Math, But Long Chain-of-Thought Can Help

LLMs Can’t Generalize Far on Grade-School Math, But Long Chain-of-Thought Can Help

ACL ARR 2025 February Submission5027 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We investigate how large language models (LLMs) generalize to math problems requiring extended reasoning. Specifically, we assess LLM performance on GSM8K-like problems, which require an increasing number of operations to solve, and conduct a detailed analysis of the models' responses. Our findings reveal a significant decline in accuracy as problem complexity increases, while thinking time (both token length and reasoning steps) naturally grows. Additionally, we explore whether using more complex Chain-of-Thought (CoT) prompts can enhance generalization and examine the performance of reasoning models (i.e., o1-like models that generate long CoTs). Although reasoning models show better length generalization, this limitation persists, highlighting inherent constraints in current LLMs' ability to learn from shorter, simpler examples and apply that knowledge to longer, more complex tasks.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: Generalization, Reasoning

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 5027

Loading