Abstract: We investigate how large language models (LLMs) generalize to math problems requiring extended reasoning. Specifically, we assess LLM performance on GSM8K-like problems, which require an increasing number of operations to solve, and conduct a detailed analysis of the models' responses. Our findings reveal a significant decline in accuracy as problem complexity increases, while thinking time (both token length and reasoning steps) naturally grows. Additionally, we explore whether using more complex Chain-of-Thought (CoT) prompts can enhance generalization and examine the performance of reasoning models (i.e., o1-like models that generate long CoTs). Although reasoning models show better length generalization, this limitation persists, highlighting inherent constraints in current LLMs' ability to learn from shorter, simpler examples and apply that knowledge to longer, more complex tasks.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: Generalization, Reasoning
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 5027
Loading