Long Chain-of-Thought Fine-tuning via Understanding-to-Reasoning Transition

Long Chain-of-Thought Fine-tuning via Understanding-to-Reasoning Transition

ACL ARR 2025 May Submission3561 Authors

19 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Reasoning models have demonstrated remarkable performance on complex tasks by generating long reasoning traces prior to producing final answers. However, previous research on long-context scaling in language models has generally focused on managing lengthy input prompts instead of producing long outputs. To leverage the strong long context understanding abilities of current models, we introduce Understanding-to-Reasoning Transition (URT) fine-tuning, a sequence-level curriculum learning framework that gradually shifts a model’s focus from interpreting long chain-of-thoughts to generating them. By incorporating partial reasoning steps in the input context, URT naturally exposes the model to diverse prompt lengths during training, preserving its performance on long-context comprehension while developing advanced reasoning capabilities. Experiments on rigorous reasoning benchmarks, including AIME24 and GPQA Diamond, reveal that our approach surpasses standard fine-tuning by over 10\%, while maintaining robust performance on the understanding tasks in RULER.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: chain-of-thought, fine-tuning

Languages Studied: English

Submission Number: 3561

Loading