How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Test time scaling
Abstract: Recent breakthroughs in large language models (LLMs) have markedly advanced their reasoning progress through two broad post‑training paradigms: supervised fine‑tuning (SFT) and reinforcement learning (RL), particularly on mathematical and logical problems that have verifiable answers. Prior research indicates that RL effectively internalizes search strategies, enabling long chain-of-thought (CoT) reasoning, with backtracking emerging naturally as a learned capability. However, the precise benefits of backtracking—specifically, how significantly it contributes to reasoning improvements and the optimal extent of its use—remain poorly understood. In this work, we systematically investigate the dynamics between SFT and RL on eight reasoning tasks: Countdown, Sudoku, Arc 1D, Geometry, Color Cube Rotation, List Functions, Zebra Puzzles, and Self Reference. Our findings highlight that short CoT sequences used in SFT as a warm-up do have moderate contribution to RL training, compared with RL without any SFT warm-up; however such contribution diminishes when tasks become increasingly difficult. Motivated by these observation, we introduce a backtracking‑centric training recipe. By synthetically varying the number of explicit backtracking steps in the SFT warm‑up, we show that (i) longer CoTs containing backtracks stabilize and amplify RL, and (ii) the optimal backtrack depth scales with task difficulty—zero for Arc 1D, one for Countdown, and five for Sudoku—yielding up to a 28.9\% absolute accuracy boost at the 3B parameter scale. Collectively, our controlled experiments provide concrete guidance for constructing training mixtures that reliably push LLM reasoning beyond current boundaries.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9858
Loading