Keywords: Large language model, Reasoning model, Long chain-of-thought
TL;DR: We demonstrate that reasoning length, not problem difficulty, plays the dominant role in training reasoning models, and introduce the Long1K dataset and Long1K-32B model, which achieve state-of-the-art results with only 1,000 training samples.
Abstract: Difficult problems, which often produce longer reasoning traces, are widely regarded as key drivers for enhancing the performance of reasoning models. In this work, we challenge this coupled assumption by disentangling problem difficulty and reasoning length, and demonstrate that reasoning length itself plays the dominant role. We introduce a simple yet effective method to synthetically construct long-chain reasoning data without requiring inherently challenging tasks, leading to the Long1K dataset, comprising only 1,000 training samples. Fine-tuning on Long1K produces Long1K-32B, which achieves state-of-the-art results on benchmarks such as MATH500 (95.6%) and GPQA Diamond (71.1%), outperforming models trained on vastly larger datasets, including DeepSeek-R1-Distill-Qwen-32B, with improvements of 1.3% and 9% respectively. Further analysis shows that longer reasoning sequences promote more structured reasoning, improve long-range instruction following, and achieve superior scaling efficiency compared to inference-only strategies. Our findings establish reasoning length as a critical and independent scaling axis for enhancing the reasoning capabilities of large language models. The model, code, and dataset are all open-sourced, available at https://anonymous.4open.science/r/LONG1k-32B.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12853
Loading